As the NN is being trained, beta and gamma (corresponding to the shift of the mean and scale of the variance) are calculated. Also it is noted in the lectures that these parameters are being learned by the network. But if we compute it anew each time for the batch, how is it being learned?
For instance, on batch 4 beta was calculated for layer 3, and then for batch 5 it is being calculated again for layer 3 based on batch 5 sample. Where is the learning? The only way I see this happening is if we are talking about learning between epochs in the case that we remember specifically the beta and gamma parameters for the specific layer in each specific batch. But then we lose the ability to calculate between batches in the same epoch as we do for W and b and it forces us to keep our batches the same (while I thought the data is reorganized to different batches between each epoch). Is this correct, or something else is going on here?
Hey @RTee,
If I am not wrong, you seem to be confusing gamma as the variance of a batch, and beta as the mean of a batch. I guess, for answering this question, we don’t need to go any further than the Tensorflow documentation of the BatchNormlization layer, which you can find here.
I would like to highlight the formulation with which BatchNorm normalizes it’s inputs (borrowed from the docs only);
Here, you can see that mean(batch) and var(batch) are computed for each of the batches (no learning involved) individually, and on the other hand, gamma and beta are similar to the weights of a typical neural network, parameters which are learnt over the different batches (learning involved).
In fact, sharing of gamma (learned scaling factor) and beta (learned offset factor) among the different batches is what provides a uniform normalization scheme for all the batches. Had they been different for the different batches, the samples in each of the batches would be normalized to different values, and there would be no use of Batch Normalization. In fact, had they been different, then why would we bother to perform batch normalization at all, since all the batches have already different mean and variance (without any normalization).