Have you, when conducting deep learning projects, ever encountered a situation where the more layers your neural network has, the slower the training becomes?
If your answer is YES, then congratulations, it’s time for you to consider using batch normalization now.
As the name suggests, batch normalization is a technique where batched training data, after activation in the current layer and before moving to the next layer, is standardized. Here’s how it works:
- The entire dataset is randomly divided into N batches without replacement, each with a mini_batch size, for the training.
- For the i-th batch, standardize the data distribution within the batch using the formula:
(Xi - Xmean) / Xstd. - Scale and shift the standardized data with
γXi + βto allow the neural network to undo the effects of standardization if needed.
The steps seem simple, don’t they? So, what are the advantages of batch normalization?
Neural networks commonly adjust parameters using gradient descent. If the cost function is smooth and has only one lowest point, the parameters will converge quickly along the gradient.
But if there’s a significant variance in the data distribution across nodes, the cost function becomes less like a pit bottom and more like a valley, making the convergence of the gradient exceptionally slow.
Confused? No worries, let’s explain this situation with a visual:
First, prepare a virtual dataset with only two features, where the distribution of features is vastly different, along with a target function:
rng = np.random.default_rng(42)A = rng.uniform(1, 10, 100)
B = rng.uniform(1, 200, 100)
y = 2*A + 3*B + rng.normal(size=100) * 0.1 # with a little bias