What I understood about batch norm is that it forces the Z of one hidden layer to have some constant mean and variance and therefore the subsequent hidden layers can learn independently of the previous hidden layers even if the weights and bias associated with the previous hidden layer change as the training goes on(which will change inevitably). Now the mean and variance of Z is governed by beta and gamma associated with that hidden layer. My question is , beta and gamma are itself trainable parameters of our learning algorithm, and therefore will change after some iterations (beta = beta - dbeta, gamma = gamma -dgamma). So the mean and variance will not remain constant in each iteration and if that happens how are we going to get the same desired result ,i.e forcing the distribution of Z to have a constant mean and variance…(we want our hidden layers to learn independently of the previous layers). Please correct me if I am wrong .

Let me offer my intuition:

Batch Normalization centers the batch inputs to be unit Gaussian - with mean 0 and std 1, then it scales them and offsets them by the learned gain (gamma) and bias (beta) accordingly. It also keeps track of the means and standard deviations of the inputs by maintaining the running means of them (they will be later used for inference, so that even a single example in the batch would work, but not only that - they get somewhat stable over time).

We are using BatchNorm to control the statistics of activations in the NN. The purpose of these statistics is to control the activations of the hidden states *so that they would not be too small or too large* (in general, we want them to be roughly Gaussian, at least at the start). This is the reason why it is so useful in training **Deep** NNs. Usually they are placed after Linear(Dense) layers or CNNs.

Answering you question:

If I understand you correctly here you are confusing Batch Normalization with Layer Normalization (which is different). If not, then I just misunderstood you.

In any case, Batch Normalization depends on other examples in the batch.

What happens, *first* it forces to have mean of 0 and standard deviation of 1 over **the batch**, for example, if a batch is of shape (32, 100), then the mean over axis=0 would result in 100 zeroes (approximately), and the std(axis=0) would result in 100 ones (approximately). Then, *second*, it multiplies the values by gamma and adds bias (beta). After this step the mean and deviations might be a little bit different.

What happened here is just that the activations of the previous layer were somewhat “tamed” so that the next layer could work with values that are not too small or too large.

Also, as a side note, a little bit of jittering of the BatchNorm (the mean and variance) acts like as a regularizer - helps to train the NN to be more robust.

Also, note that gamma and beta in the mentioned example would be vectors with length 100. For example gamma could be [2.33, 2.27, 2.35 … 2.20] and beta could be [0.14, 0.59, -0.11 … 0.55] and if this would be the case, then the mean of the outputs Batch Normalization layer would not have 100 zeroes, but something like [0.14, 0.59, -0.17 … 0.55] and standard deviations like [2.68, 2.40, 2.55 … 2.20].

Thank you for your explanation. Actually what I am trying to say is that as beta and gamma are trainable parameters, the values of beta and gamma won’t be same for every training iterations, therefore in every iteration , after normalizing Z to have mean 0 and std 1, when we multiply it with gamma and add beta, it will have some mean(governed by beta in that particular iteration) and some std (governed by gamma in that particular iteration). Then can I say that in each iteration, Z will have some different mean, and deviation, i.e the distribution won’t have a constant mean and std.

If what I am saying is correct, then it won’t have the desired effect in training a neural network as after every iteration the mean and variance of the distribution will also change, it will not remain constant, therefore the hidden layers won’t learn independently. Am I correct in saying this?

Correct, though the differences won’t be very big, they will be small.

As I mentioned it will have a desired side effect of regularization - it will jitter a bit the inputs, like small data augmentation, which will help the model to be robust. It will not shake the inputs that much so the learning would be impossible.

I’m not sure what you mean by that? (I’m not a mentor of that DLS so I’m not sure of the context). But in most cases (not considering fancy architectures) all layers depend on each other - one layer’s output is another’s input and the gradients flow accordingly.

What Batch Norm does on the other hand, it ties together the samples - each sample in the batch effects the mean and the std (which is not desired in some cases, for example, in RNNs when sequences have different lengths). So if so happened that your mini-batch contain some “big” example, that effects the predictions for the other samples (which sounds ridiculous at first and should not be that way, but in reality it’s not a problem).

Again, many thanks for clearing my doubts.

By saying 'hidden layers will learn independently ’ I mean to say that whatever be the output of the previous layers,by forcing it to take a constant mean and variance, in each iteration the weights of the subsequent layers won’t be that much dependent on the changes of the weights of the previous layers. That’s what professor ng said, and I think the intuition behind this is correct

In the lecture 'Why batch norm works", professor ng said this -

“And so, batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on. And even though the input distribution changes a bit, it changes less, and what this does is, even as the earlier layers keep learning, the amounts that this forces the later layers to adapt to as early as layer changes is reduced or, if you will, it weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network. So I hope this gives some better intuition, but the takeaway is that batch norm means that, especially from the perspective of one of the later layers of the neural network, the earlier layers don’t get to shift around as much, because they’re constrained to have the same mean and variance. And so this makes the job of learning on the later layers easier.”

The most important words here are **“that much”**, which means that the activations of previous layer will be “tamed” and will not have excessive values so that the following layer could manage them, but the following layer is definitely dependent from the previous one’s outputs. In other words it offers some balance.

Oh okay, got it. I think now I have understood the whole concept. Thanks again for clearing my doubts, it’s been a huge help