Why You Should Never Use Cross-Validation | by Samuele Mazzanti

As a data scientist, it frequently happens to me that I need a quick and dirty estimate of how a predictive model would perform on a given dataset. For a long time, I did this through cross-validation. Then, I realized I was completely off track. Indeed,

With real-world problems, cross-validation is completely untrustworthy.

Since I could bet that many data scientists still rely on this technique, I think it’s very relevant to take a deep dive into this topic.

In this article — with the help of a toy example and a real dataset — I will go through the reasons why cross-validation is never a good choice when dealing with real-world problems.

Cross-validation is a model validation technique used to obtain an estimate of how a model trained on a dataset will perform on a new (unseen) set of data.

Note: there are many types of cross-validation. In this article, for simplicity, when we say “cross-validation” we refer to random K-fold cross-validation, which is by far the most common type of cross-validation.

So how does cross-validation work in practice?

Suppose we have a dataset ready for a machine learning model, i.e. a matrix of predictors (that we call X) and a vector containing the target variable (that we call y).

Cross-validation consists in assigning each row of the dataset to one of the K folds (typically K=5). Note that the assignment is done completely at random by making sure that all the folds have approximately the same number of rows.

For instance, let’s take a dataset consisting of 10 rows and K=5.

Random K-fold cross-validation on a dataset (K=5). [Image by Author]

The cross-validation process works by training one different model on each combination of K-1 folds and measuring its performance on the remaining fold.