DL Notes: Advanced Gradient Descent | by Luis Medina | Dec, 2023

Contents

Adaptive Gradient Algorithm (AdaGrad)Root Mean Square Propagation (RMSprop)

The other optimization algorithms shown in the animation above are adaptive methods, which I’ll describe in this section.

Looking at this simple example with a local and a global minimum, Momentum and NAG may seem far superior to the other methods. However, adaptive algorithms are more robust. I’ll show this with a practical example in another article.

Adaptive Gradient Algorithm (AdaGrad)

AdaGrad is a family of subgradient algorithms for stochastic optimization, presented by John Duchi, Elad Hazan and Yoram Singer in 2011.

They proposed to improve gradient-based learning by incorporating the history of the gradients into each new update of the weights.

Instead of biasing the gradient itself, as momentum does, AdaGrad modifies the learning rate dynamically and separately for each parameter of the objective function.

This means we have different learning rates for each model weight. They are adjusted based on the consistency of the gradients.

To do this, the sequence of gradient estimates is stored as follows:

Sum of squared gradients or outer product of gradient history.

If we are optimizing a function with n coordinates or parameters, gₜ will be a vector with n elements, and so will Gₜ.

Then, the update rule is given by:

AdaGrad update.

The parameter ε is used to avoid a division by zero and is usually set to a small value, like 1e-08.

Interestingly, the definition of Gₜ is similar to the un-centered (zero-mean) variance of the distribution of gradients.

The variance is a measure of the dispersion energy of a distribution.

Hence, for each parameter θᵢ, the learning rate is adapted in proportion to the inverse of the gradient’s variance for θᵢ.

Considering this, we could say that parameters with more dispersion in their gradient distribution will scale down the learning rate by a larger factor, while those with more consistent gradients (lower variance) will have larger learning rates.

AdaGrad also implements learning rate decay automatically, based on the time (accumulation of previous gradients) and the curvature of the objective function (“areas” with lower gradient variance will be assigned smaller step sizes). This improves the convergence rate of the algorithm.

I’ve implemented AdaGrad as a Python function as follows:

def Adagrad(x_init, y_init, step_size, n_iters):
eta = step_size
G_t = 0
eps = 1e-8theta = np.tile([x_init, y_init], (n_iters,1) )
z = np.tile([f(theta[0])], n_iters )
for k in range (1, n_iters):
# Compute gradient
g_t = grad_f(theta[k-1])
# Accumulate squared gradients
G_t += g_t**2
# Update position
theta[k] = theta[k-1] - eta * g_t / (np.sqrt(G_t) + eps)
z[k] = f(theta[k])
# Store position coordinates
dataSet = np.stack((theta[:,0], theta[:,1], z), 1)
return dataSet

One drawback of AdaGrad is that this learning rate decay during training may be too aggressive, causing the learning to stop early when training ANNs. Each parameter update is robust, but the rate at which the changes move toward the optimal point can decrease too much.

Another drawback is that, although the learning rates are self-adjusted during learning, AdaGrad can still be sensitive to the initial conditions. If the gradients are large at the start of the optimization, the learning rates will be low for the rest of the training.

We can see this in the animated figure. AdaGrad breaks off from the symmetry quickly, but the learning is very slow, compared to other algorithms.

To compensate for this, one may need to tune the learning rate to a higher value, which in part defeats the purpose of the self-adjusting characteristic.

Root Mean Square Propagation (RMSprop)

Unpublished method, but mentioned in the slides for lecture 6 of the course Neural Networks for Machine Learning, from Prof. Geoffrey Hinton.

The concept of this algorithm is similar to momentum. It also incorporates the short-term history of the magnitude of the gradients to perform the weights update.

However, similarly to AdaGrad, RMSProp modifies the learning rate and not the gradient.

To do this, the learning rate is divided by a running average of the magnitudes of recent gradients.

First, the algorithm computes a weighted sum of the squared cost values and the previous ones:

Exponentially weighted sum of squared costs.

This is like a short-term mean, where the parameter β adjusts how much weight is given to more recent cost values over older ones.

It is analog to the re-written form of momentum that I mentioned before but applied to the squared costs, instead of the gradient.

The next step is dividing the learning rate by the square root of this moving average.

RMSProp update rule.

This way, the step size depends on the history of the gradient magnitudes (short-term memory).

Notice that computing the root of a weighted sum (or weighted average) of squared values is equivalent to computing the Root Mean Square (RMS) of those values.

The RMS of a signal is a representation of its total energy (as opposed to the variance, which represents its dispersion energy)[1].

Therefore, with RMSProp, the learning rate is modulated according to the total energy of the gradient of the cost function and its previous values. This adjustment is done dynamically and for each direction or component of our loss function (each weight!).

The goal is to reduce the volatility caused by large changes in the gradient by reducing the step size in those cases.

This also helps with vanishing gradient problems because when there’s a trend of very small gradients we take larger steps.

This is how I coded it as a Python function:

def RMSProp(x_init, y_init, step_size, n_iters, decay):
beta = decay # 0.8, 0.9, ..., 0.99
eta = step_size
eps = 1e-8
MSQ = 0theta = np.tile([x_init, y_init], (n_iters,1) )
z = np.tile([f(theta[0])], n_iters )
for k in range (1, n_iters):
# Compute gradient
g_t = grad_f(theta[k-1])
# Compute the weighted mean of squared values
MSQ = beta * MSQ + (1 - beta) * g_t**2
# Update position (divide eta by RMS)
theta[k] = theta[k-1] - eta * g_t / (np.sqrt(MSQ) + eps)
z[k] = f(theta[k])
# Store position coordinates
dataSet = np.stack((theta[:,0], theta[:,1], z), 1)
return dataSet

RMSprop is robust to the initial choice of learning rate, and it also implements an automatic learning rate decay. However, since it is based on a short-term history of gradient values, the decay is less aggressive than AdaGrad.

Proposed by Matthew Zeiler in 2012.

This method was developed to overcome the main limitations of AdaGrad: the continuous decay of the learning rate that causes early stopping and the need to tune the “global” learning rate manually.

To overcome the continuous learning rate decay, the algorithm accumulates the history of past gradients over a window or fixed size.

In practice, this involves dividing the learning rate by the RMS of previous gradients over a window of fixed size, just like RMSprop does:

Learning rate scaling similar to RMSProp.

The next modification from AdaGrad is the correction of the units of the optimization updates.

In AdaGrad (and all other optimization algorithms I’ve described so far), the units of the optimization steps don’t match the units of the parameters that we modify to optimize the cost function [9]:

We know from school that we can’t add apples and oranges. But with these optimization algorithms, it’s like we’ve been adding “apples” (current parameter values, θₜ and some unknown quantity (the optimization step Δθ ) that mathematically can be added to them to obtain new apples (the updated parameters, θₜ ₊₁. It just works, but doesn’t make sense in real life.

Zeiler decided to correct the units, rearranging the update term from Newton’s method and assuming the curvature of the loss function could be approximated by a diagonal Hessian matrix:

Comparing this observation with the update rule similar to RMSProp, Zeiler determined the correct form of the update term to preserve the right units.

The intuition is better explained in the original publication but in practice, it resulted in adding the square root of an exponentially-weighted average of the previous update values to the numerator of the update term:

AdaDelta step for parameter update.

This is basically assuming that the loss function is smooth (low curvature) within a small window of size w, so that Δθₜ can be approximated by the exponential RMS of the previous values.

The algorithm looks like this if we implement it as a Python function:

def AdaDelta(x_init, y_init, step_size, n_iters, decay):eta = step_size
G_t = 0
eps = 1e-8
E_gsq = 0
E_xsq = 0
theta = np.tile([x_init, y_init], (n_iters,1) )
z = np.tile([f(theta[0])], n_iters )
for k in range (1, n_iters):
g_t = grad_f(theta[k-1])
E_gsq = decay * E_gsq + (1 - decay) * g_t**2
delta = - np.sqrt(E_xsq + eps) / np.sqrt(E_gsq + eps) * g_t
E_xsq = decay * E_xsq + (1 - decay) * delta**2
theta[k] = theta[k-1] + delta
z[k] = f(theta[k])
# Setting up Data Set for Animation
dataSet = np.stack((theta[:,0], theta[:,1], z), 1)  # Combining our position coordinates
return dataSet

AdaDelta combines the advantages of the optimization methods it builds upon.

For instance, the short-term memory of previous parameter updates in the numerator is similar to Momentum and has the effect of accelerating the gradient descent.

The denominator provides the per-dimension accuracy of AdaGrad but without the excessive learning rate decay (just like RMSProp).

Additionally, AdaDelta is more robust to sudden gradient changes, and it is robust to the choice of initial learning rate (see a practical example in the last section of this article).

This is one of the most popular algorithms today.

It was introduced by Diederik P. Kingma and Jimmy Lei Ba in 2014, and has become very popular because of its computational efficiency and because it works very well for problems with large amounts of data and parameters.

Adam is like a combination of Momentum and RMSprop because it dynamically changes both the gradient of the loss function and the learning rates used to scale such gradient to update the weights.

To do this, the algorithm includes the computation of two terms that will be familiar from previous sections of this article.

First, there’s a term from momentum, an exponentially weighted sum of the previous gradients of the cost function (this is like a weighted variance):

Exponentially weighted average of cost gradients.

Then, there’s a term from RMSprop, an exponentially weighted moving average of the squared gradients.

Exponentially weighted average of squared cost gradients.

Combining both terms with the SGD algorithm, the information from past gradients is included in the update step. Their total energy over a short window (RMS) is used to scale the learning rate, and their dispersion (variance) helps adjust the current gradient value used for updating the weights.

Adam’s update rule.

The values with tilde (~) correspond to bias correction terms that are introduced to reduce the contribution from the initial values of m and v as learning progresses:

Initialization bias correction terms for Adam.

t = current training epoch.

Unlike AdaDelta, Adam does require tuning of some hyperparameters, but they are easy to interpret.

The terms β₁ and β₂ are the decay rates of the exponential moving averages of the gradients and the squared gradients, respectively.

Larger values will assign more weight to the previous gradients, giving a smoother behavior, and less reactive to recent changes. Values closer to zero give more weight to recent changes in the gradients. Typical values are β₁ = 0.9 and β₂ = 0.999.

ε is, like in all previous cases, a constant added to avoid division by zero, usually set to 1e-8.

Despite having a bunch of additional terms, and significant advantages, Adam is easy to implement:

def Adam(x_init, y_init, step_size, n_iters, 
beta_1 = 0.9, beta_2 = 0.999):eps = 1e-8
eta = step_size
# Initialize vectors
m_t = np.array([0,0])
v_t = np.array([0,0])
theta = np.tile([x_init, y_init], (n_iters,1) )
z = np.tile([f(theta[0])], n_iters )
for k in range (1, n_iters):
# Compute gradient
g_t = grad_f(theta[k-1])
# Compute "momentum-like" term (weighted average)
m_t = beta_1 * m_t + (1 - beta_1)*g_t
# Compute the mean of squared gradient values
v_t = beta_2 * v_t + (1 - beta_2)*g_t**2
# Initialization bias correction terms
m_t_hat = m_t/(1 - beta_1**k)
v_t_hat = v_t/(1 - beta_2**k)
# Update position with adjusted gradient and lr
theta[k] = theta[k-1] - eta * m_t_hat/(np.sqrt(v_t_hat)+ eps)
z[k] = f(theta[k])
# Store position coordinates
dataSet = np.stack((theta[:,0], theta[:,1], z), 1) 
return dataSet

Interestingly, the authors of the paper point out that the term

Adam’s learning rate scaling.

resembles the definition of the Signal-to-Noise Ratio (SNR):

Signal-to-Noise ratio.

Then, we could say that for smaller SNR values, the parameter updates will be close to zero. This means that we won’t perform large updates whenever there’s too much uncertainty about whether we are moving in the direction of the true gradient.

Adam and its variants typically outperform the other algorithms when training DL models, especially when there are very noisy and sparse gradients.

I decided to compare how the different optimizers perform when initialized with different “global” learning rates.

This is a rather simple example, but it gives an idea of how these methods are affected by the choice of learning rate.

Comparing the evolution of x and y coordinates during optimization with different algorithms. For Momentum and NAG, mu = 0.95. For RMSProp and AdaDelta, decay parameter = 0.9.

AdaDelta seems remarkably robust to the global learning rate setting, and it “descends” at the same rate for all three cases. We can also see how AdaGrad requires larger learning rates to achieve a performance comparable to AdaDelta in this case.

For small learning rates, it is clear that Adam and RMSProp are similar, and superior to Momentum and SGD.

However, for larger learning rates, RMSProp shows consistent oscillations around the optimum x value (x = 0), while Adam stabilizes after the initial transient, thanks to the dampening effect of the momentum term in the numerator.

The adaptive algorithms break off the symmetry earlier than SGD and Momentum methods, except for the case with a global learning rate of 0.1 for which Momentum and NAG outperform AdaDelta.

Again, these observations only apply to this particular scenario.