Avoid Overfitting in Neural Networks: a Deep Dive | by Riccardo Andreoni

Learn how to implement regularization techniques to boost performances and prevent Neural Network overfitting

When training a deep neural network, it’s often troublesome to achieve the same performances on both the training and validation sets. A considerably higher error on the validation set is a clear flag for overfitting: the network has become too specialized in the training data. In this article, I provide a comprehensive guide on how to bypass this issue.

When dealing with any machine learning application, it’s important to have a clear understanding of the bias and variance of the model. In traditional machine learning algorithms, we talk about the bias vs. variance tradeoff, which consists of the struggle of minimizing both the variance and the bias of a model.

In order to reduce the bias of a model (i.e. its error from erroneous assumptions), we need a more complex model. On the contrary, reducing the model’s variance (the sensitivity of the model in capturing the variations of the training data), implies a more simple model. It is straightforward that the bias vs. variance tradeoff, in traditional machine learning, derives from the conflict of necessitating both a more complex and a simpler model at the same time.

In the Deep Learning era, we have tools to reduce just the model’s variance without hurting the model’s bias or, on the contrary, to reduce the bias without increasing the variance.

Before exploring the different techniques used to prevent the overfitting of a neural network, it’s important to clarify what high variance or high bias means.

Consider a common neural network task such as image recognition, and think over a neural network that recognizes the presence of pandas in a picture. We can confidently assess that a human can carry out this task with a near 0% error. As a consequence, this is a reasonable benchmark for the accuracy of the image recognition network. After training the neural network on the training set and evaluating its performances on both the training and validation sets, we may come up with these different results: