I have several questions regarding batch normalization:

In the lecture prof. Andrew mentioned that applying it makes input distributions not changing massively at later layers thus decoupling late layers from early layers so:

1- The neural network itself is still learning the distribution at each layer during training so how does this solve internal covariance shift when I still don’t know the distribution that I map each layer inputs to?

2- How does it help to generalize during testing on similar data with different characters (like lecture cats example), given that I’m keeping the distributions across each layer constant despite the input values within those distributions are changing?

3- How does it speed the training in analogy with the input features normalization to prevent input layer from taking different ranges of input features, when the distribution itself is still being trained on, unlike input features that are normalized once with a constant distribution?

4- Why the adaptation of later layers to early layers change is reduced, where early changes lead to different layer output values even within the same distribution can lead to different outputs?

Hi @Omar_Aziz

Batch normalization normalizes the inputs of each layer to have a mean of zero and a variance of one. This helps each layer to learn on a more consistent distribution in training phase.

It standardizes the activations during training. This normalization across layers helps the model perform well on similar data in test phase.

Batch normalization can handle the issue of varying input distributions by nromalizing the inputs of each layer. This makes gradients more stable and convergence faster, analogous to input feature normalization.

It reduces the dependency of later layers on the exact distribution of earlier layer outputs by normalizing these outputs. This action makes the training of each layer more independent and robust to changes and reduces the impact of small changes in early layers.

Hope it help! Feel free to ask if you need further assistance.

I mean why does this action lead to this independency and why exactly later layers not earlier ones, I get the concept but not the intuition or reasoning behind it why does normalizing do that?

But there is still beta and gamma that the network is still learning to get the proper distribution, so still regarding question 1 and 3 each layer is still training on the distribution which means it doesn’t know what it is to solve covariant shift problem and normalization isn’t constant as in input layer case to accelerate learning?

Sorry to bother you I just need to understand.

No problems at all! Now, let’s go deeper in this context:

As you may know, internal covariate shift refers to the changes in the distribution of layer inputs during training, which can slow down the training process. However, batch normalization solves this by normalizing the inputs of each layer to have a mean of zero and a variance of one.

When the inputs to a layer are normalized, the layer’s parameters (weights and biases) no longer need to adapt to varying input distributions. This means that each layer can learn its weights based on inputs that have a consistent distribution, regardless of the changes happening in previous layers. This decouples the learning process of each layer from the distributions of preceding layers.

You are correct that batch normalization involves learnable parameters, beta and gamma, which help the network to scale and shift the normalized outputs. These parameters allow the network to learn the optimal distribution for each layer’s inputs during training. These parameters provide the flexibility to adjust the normalized outputs to better fit the data.

Hope this clarifies your confusion! Feel free to ask if you need further assistance.

why is this effect more significant in later layers than earlier ones as mentioned by prof.Andrew?

So still my question:

covariance shift problem is solved by keeping my distribution constant regardless of the layer inputs.

my distribution isn’t constant during training because the neural network is still learning it, so how is this problem solved?

Because the variability in the distributions of inputs can accumulate and compound as data passes through multiple layers. In deeper networks, these accumulated shifts in distributions can become more pronounced. Batch normalization solves this issue.

Batch normalization adjusts the distributions of inputs during training, using the mean and variance calculated from each mini-batch. While the exact distribution is not fixed and is indeed being learned, the normalization step makes sure that the inputs to each layer keep a consistent scale and distribution throughout training.

Without batch normalization, the inputs to a given layer can have different distributions during training and the layer needs to continuously adapt to these shifts, leading to a slower convergence. Also, If the variance of the output of each layer depends on the variance of its input, this can lead to exploding or vanishing variances as we move deeper into the network.

In mathematical terms, the gradient of the loss L with respect to the input x_i to a batch-normalized layer:

\frac{\partial L}{\partial x_i} = \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \left( \frac{\partial L}{\partial \hat{x}_i} - \frac{1}{m} \sum_{j=1}^{m} \frac{\partial L}{\partial \hat{x}_j} - \hat{x}_i \sum_{j=1}^{m} \frac{\partial L}{\partial \hat{x}_j} \hat{x}_j \right)

As can be seen, the gradients are scaled by \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} , which keeps them fine and avoids issues with exploding or vanishing gradients that makes the training process faster.

Hope it helps!

Thanks a lot

You’re welcome