Batch Normalization Questions

Rod_Bennett · September 13, 2023, 3:28pm

I have found this topic to be one of the more difficult concepts to wrap my head around (as a newbie to ML.)

I understand what batch normalization is doing, but having trouble understanding the new parameters Beta and Gamma.

According to the lecture, after batch normalization is computed, z_tilde[1] = a = g1. That makes sense.

But what happens next? How does Beta and Gamma come into existence through this process? And how are they substituting W and b params?

canxkoz · September 15, 2023, 6:42pm

Hello @Rod_Bennett ,
Thanks a lot for asking this question on Discourse. I am a mentor and I will do my best to give a reply to your issue. Firstly I will do an explanation of terms and the reasons why gamma and beta exist. Later, I will give a specific answer to your question.

Batch normalization normalizes the layer inputs to have a mean of 0 and standard deviation of 1. This helps the network learn faster and more stably.

However, removing all variation in the inputs can be too restrictive. The gamma and beta parameters allow the network to restore the original input distribution if needed.

Gamma acts as a scaling factor and beta acts as an offset factor. They are learned parameters.
If the original input distribution is useful, the network can learn gamma and beta values that result in the final normalized outputs matching the original unnormalized values.

So gamma and beta give the network flexibility to undo the normalization if it helps optimize the loss function.

To answer your specific question:
After computing the batch normalization, the next step involves introducing the beta and gamma parameters. These parameters are used to scale and shift the normalized values, allowing the network to restore the original input distribution if needed. Beta and gamma are learned parameters, which means they are updated along with the network weights and biases during training.

Beta acts as an offset factor, while gamma acts as a scaling factor. If the original input distribution is useful, the network can learn gamma and beta values that result in the final normalized outputs matching the original unnormalized values. In other words, gamma and beta give the network flexibility to undo the normalization if it helps optimize the loss function.

It’s important to note that beta and gamma are not substituting the weights (W) and biases (b) of the neural network. Instead, they are additional learnable parameters that work alongside the weights and biases to provide more flexibility in the network’s learning process. After normalizing the activations, the batch normalized values are multiplied by gamma and beta is added. These parameters are learned along with the network weights and biases during training. So gamma and beta allow the network to scale and shift the normalized values to restore the original distribution if needed to minimize the loss.

If you have a followup question, please feel free to reply with a followup. I hope my response clarified your question.
Regards,
Can Koz

Rod_Bennett · September 15, 2023, 11:54pm

Thanks for your response, much appreciated. It’s a lot clearer now.

Topic		Replies	Views
Learning beta and gamma in Batch Norm Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	628	September 4, 2022
Question about DLS Course 2 week 3 quiz Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	615	May 9, 2021
Course 2 week 3 quiz Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	632	November 4, 2021
How to initialize the gamma and beta parameters for batch norm Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	545	May 15, 2021
Question about batch norm Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	571	April 26, 2023

Batch Normalization Questions

Related topics