Recently from the NimbleBox writers, you read about various Performance Metrics used by experts and professionals in the field to train their models. However, something else can empower your model training arsenal. Cross Validation is a technique used in machine learning to evaluate, tune and test predictive models.
Over years of knowledge and cross-experimentation, several ways have been established to create the ideal operating state for machine learning models, complete with hyperparameter tuning and data performance. Cross Validation is one such method.
Let us explore the technique and where it comes in your MLOps pipeline.
Cross-Validation is a machine learning technique that helps compare and choose various machine learning models and architectures. Working more like a post-training method, it works well when implemented while evaluating the models.
The model is based on the assumption that similar to humans; machines require some generalization to provide flexibility when performing some of these tasks, such as classification—primarily tackling and identifying how models perform against various selections of data while remaining flexible to perform tasks without being rigid to one set of data.
Much like a child, if a model is subjected to just the same kind of data, it may become too accustomed to it and abandon its ability to adapt to new data. This dilemma brings up the concept of Over-Fitting, so let us look into over-fitting before we dive into the intricacies of Cross-Validation.
Over-fitting is when a model becomes overly acclimated to a data set and incapable of performing any other function on that data. When this condition happens, the model is unfortunately not able and ready to tackle real-life situations where it may be exposed to unknown data, totally defeating the purpose of the model and, in turn, your entire MLOps pipeline.
Cross-Validation is a resampling technique in which the training data is split into various divisions like testing and validation and recurrently and randomly exchanged to train and test the model to achieve the desired performance.
Let us look at some of the variations of Cross-Validations used in the industry with their pros and cons.
One of the classic cross-validation techniques, Hold-Out, is the most straightforward technique that works on the simple principle of randomly splitting the dataset into two parts, training and hold-out. We then train the model on the training set and evaluate it on the hold-out set.
Depending on the data’s size, the ratio of the divide is around 60:40 in terms of train and test, respectively. This model seems to shine when the model is trained relatively simply, and we don’t require to experiment too much with the hyperparameters.
One huge disadvantage of this method is that the error calculated on the hold-out set depends on the selection of data, which, even though random, may not give you the full scale and scope of the model.
One of the most well-known techniques for cross-validation, K Folds, has been the general go-to for many models and engineers. This resampling technique divides the data into “k” numbers of equal sections. These sections are then individually used for training and testing.
How do these sections work? First, one of the equal sections is set aside for testing, and the remaining sets are used for the actual workout. These sections are then iteratively used with the accuracy achieved for each set/iteration stored as a form of an array. These accuracies are then used to calculate an aggregate.
One of the most important pros of using this cross-validation method is that each data point gets to be part of training which helps iron out any outliers that may corrupt the desired accuracy.
Leave One Out Cross-Validation or LOOCV is another frequently used cross-validation technique that stems from the K-Fold Cross-Validation technique. A more checked out and extreme version of K-Fold, LOOCV works on the principle of turning the “k” from k fold into the total number of data points in the dataset “n.”
These n parts are then individually taken for testing data, with the remaining (n-1) used for the training. This technique may seem absolute with K-Folds but as it turns out, including every point of the data without wasting any data helps favor the method. However, this technique is far more expensive than K-Fold because the model is trained to produce aggregate accuracy.
One of the events where cross-validation seems not to work and is justified not to do so is when we are dealing with Time Series data. Before that, let’s see what Time Series data is. Time Series data is a series of data points indexed according to the collected time sequence. These points are interdependent and cannot be used randomly for traditional cross-validation techniques.
This method works by choosing a subset for the initial training and using the immediate data points to check the forecasting. Then, the earlier used primary issues are included in the training datasets. Finally, the same is repeated.
Now that we have looked at the definition and various types of cross-validation, you may ask, at the end of the day, what are the main pros that come with cross-validation that make it a necessary piece of your machine learning arsenal.
Let us look at some instances where Cross-Validation becomes necessary.
When faced with a scarcity of data, which happens quite often when working with startups, we lose a lot of training potential by simply parting a set of data separately for testing and training.
Cross-Validation helps us utilize every data point from our dataset, especially in the case of techniques like K-Folds and Leave One Out Cross-Validation, and train the model with its full potential.
When creating a regular test and train split, we work on the assumption that the data points individually are not related to one another. This results in the model ignoring the interrelations among the data points and considering them for the final model.
In cross-validation, especially with the rolling technique used for Time Series data, we see that the model can map out each point canonically while preserving the relationship between each point.
When dealing with a singular performance metric on a single model, we tend to be left with only a single value that may or may not elevate the “performance” of your model. The one-up cross-validation that holds over a general performance metric is the availability of metrics from multiple models trained with every permutation of training and testing data.
When we train the model as often as we do in K-Fold or Leave One Out, we get access to a bunch of insights that help create a strong game plan for the model training.
Cross-Validation allows us to validate the results we get from either only training the model once or calling the numbers by chance due to some technical bias in the data split.
One of the biggest challenges Cross-Validation seems to shine upon is Medical Research, where we may lack the availability of too many data points. Cross-Validation helps utilize every data point for the final aggregation of the performance.
In this article, we went through Cross-Validation, one of the most powerful tools in your Machine Learning arsenal. With K-Fold leading the race with its optimal processing needs and results, cross-validation gives you better insights into the full potential of your dataset.
We hope the article helps you figure out the best Cross-Validation technique for your needs and smoothen that MLOps pipeline.