Machine Learning Strategy (Part 1)

Machine Learning Strategy is about how to tackle machine learning tasks strategically. In three blog posts I will try to give an introduction into this topic and I also hope for some comments and opinions on this topic.

Goal Definition

The first step of a machine learning problem is usually to define the goal itself. What is the purpose, what do we want to achieve? Sometimes it is not so clear at the beginning.

After that, the next step is to think about how to achieve the goal in the best way. Questions that arise are:

Which and how much data should be collected or are available?
How should the data be structured, transformed and divided?
Which algorithm with which hyperparameters should be used?

How to solve these questions?

Metrics definition

The first step is to clearly define the measures that should be optimized:

What metric(s) do we want to optimize? (optimizing metrics)
Under which constraints should they be optimized? (satisficing metrics)
Are there observations for which we should give a stronger weight?

Example: We want to minimize the mean squared error under the constraint, that the runtime should be less than 5 minutes.

Data split

The next step is to divide the data into different parts. Usually the data is divided into three parts:

Training data: Is used for training the algorithm
Development data: Is used for evaluating the training algorithm iteratively
Test data: Is used for the final evaluation of the trained algorithm

By using a resampling strategy such as (repeated) cross-validation training and development data can be interchanged. E.g. in 5-fold cross-validation the data is divided in 5 parts and each part is once used as development data for evaluating the metric while the other parts are used for training. At the end one can e.g. take the mean of the results of the development data.

How to divide the data?

Development and test data should represent the final data for which we train the algorithm.
Ideally they should have the same probability distribution

Data sizes:

Classical division: 60% training, 20 development, 20% test → This makes sense for small datasets (100-100 000 observations) with enough data for development and test data.
For larger datasets (e.g. 1 000 000 observations) it might be enough to have smaller development and test data sets (e.g. 98%/1%/1%) or smaller training data sets (depending on the runtime and learning curve of the algorithm)
The statistical field of sample planning can be used to estimate how much data has to be used (for properly evaluating metrics in the developing and testing data sets) → Guidelines: Use enough data such that the performance can be estimated good enough on developing and test data; The result should not be randomly good or bad.
If only train and development data is used, there is the danger of overfitting on the development data and the results are not generalizable

In the following blog post, I will post more about the possibilities of improving an algorithm once it has been trained with training data and how this can be done in an iterative process.

This blog post is partly based on information that is contained in a course about deep learning on coursera.org that I took recently. Hence, a lot of credit for this post goes to Andrew Ng that held this course.

Feel free to leave a comment below and share your experiences and opinions about this topic. How do you tackle machine learning problems strategically?

Written on August 5, 2021