In one of my previous posts, we explained neural networks learn through the backpropagation algorithm. The main idea is that we start on the output layer and move or “propagate” the error all the way to the input layer updating the weights with respect to the loss function as we go. If you are unfamiliar with this, then I highly recommend you check that post:
The weights are updated using their partial derivative with respect to the loss function. The problem is that these gradients get smaller and smaller as we approach the lower layers of the network. This leads to the lower layers’ weights barely changing when training the network. This is known as the vanishing gradient problem.
The opposite can be true where gradients continue getting larger through the layers. This is the exploding gradient problem which is mainly an issue in recurrent neural networks.
However, a paper published by Xavier Glorot and Yoshua Bengio in 2010 diagnosed several reasons why this is happening to the gradients. The main culprits were the sigmoid activation function and how weights are initialised (typically from the standard normal distribution). This combination leads to the variances changing between layers until they saturate at the extreme edges of the sigmoid function.
Below is the mathematical equation and plot of the sigmoid function. Notice that in its extremes, the gradient becomes zero. Therefore, no “learning” is done at these saturation points.
We will now go through some techniques that can reduce the chance of our gradients vanishing or exploding during training.