Need help deciding which validation technique to use?
There are a number of techniques for testing the performance of newly developed machine learning models. Have you ever wondered which technique you should use for your data science project and when to use it? The goal of this post to provide some helpful guidelines.
First things first, how big is your dataset?
If you're working with a very small dataset, consider using leave one out cross validation. In this method each observation is held out from the dataset for testing. The rest of the observations are used to train the model. This is repeated N times for each observation in the dataset and the test error is computed as the average of all N errors.
If you have a decently sized to large dataset then turn your attention to the next set of methods.
Say you're just kicking off a data science project, are in a bit of a time crunch, and just need a quick and dirty model to serve as an starting point for your project. The holdout method is a good go to technique to use. This technique involves splitting the dataset into a "train" and "test" set. The training set is used to train the model and the test set is used to see how well the model performs on unseen data. A common approach is to use 80% of the data for training and 20% of the data for testing.
If you have more time on your hands K fold cross validation is the preferred method over hold out as it allows a better indication of how the model will perform on unseen data. In this approach the data is randomly split into k groups. One of the groups is used as the test set and the rest of the groups is used for training. This process is repeated until each group is used as the test set. The test error is computed as the average test error from all groups.
But regular K fold cross validation and hold is not ideal for datasets containing class imbalance. For cases like those, stratified cross validation is the way to go. In this approach the train-test splits are chosen to ensure that each set has the same percentage of examples of each target as the complete dataset.
If you choose K-fold cross validation and you intend to use it tune the hyperparameters for your model and also get an error estimate, there is much better approach my friend. Using nested cross validation for this task is suitable as it avoids overfitting. This article provides a detailed breakdown of the technique https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/.
If you're working with time series data, where the goal is to produce of forecast of future events, none of the approaches discussed earlier will work. Time series cross validation is what you'll need. In this approach a single observation is used as the test set and the training set is composed of observations that occurred prior to the test observation. This is repeated multiple observations used as test points. The forecast error is computed by averaging the errors for each test point. This approach can be extended to evaluate multi-step forecasts.
Well, that's it. Feel free to share your thoughts in the comment section.
Take care