Vanishing & Exploding Gradient Problem: Neural Networks 101 | by Egor Howell

How to ensure your neural network doesn’t “die” or “blow-up”

https://www.flaticon.com/free-icons/neural-network. title=”neural network icons.” Neural network icons created by Paul J. — Flaticon.

In one of my previous posts, we explained neural networks learn through the backpropagation algorithm. The main idea is that we start on the output layer and move or “propagate” the error all the way to the input layer updating the weights with respect to the loss function as we go. If you are unfamiliar with this, then I highly recommend you check that post:

The weights are updated using their partial derivative with respect to the loss function. The problem is that these gradients get smaller and smaller as we approach the lower layers of the network. This leads to the lower layers’ weights barely changing when training the network. This is known as the vanishing gradient problem.

The opposite can be true where gradients continue getting larger through the layers. This is the exploding gradient problem which is mainly an issue in recurrent neural networks.

However, a paper published by Xavier Glorot and Yoshua Bengio in 2010 diagnosed several reasons why this is happening to the gradients. The main culprits were the sigmoid activation function and how weights are initialised (typically from the standard normal distribution). This combination leads to the variances changing between layers until they saturate at the extreme edges of the sigmoid function.

Below is the mathematical equation and plot of the sigmoid function. Notice that in its extremes, the gradient becomes zero. Therefore, no “learning” is done at these saturation points.

Sigmoid function. Plot generated by author in Python.

We will now go through some techniques that can reduce the chance of our gradients vanishing or exploding during training.