Hello @Rod_Bennett ,
Thanks a lot for asking this question on Discourse. I am a mentor and I will do my best to give a reply to your issue. Firstly I will do an explanation of terms and the reasons why gamma and beta exist. Later, I will give a specific answer to your question.
Batch normalization normalizes the layer inputs to have a mean of 0 and standard deviation of 1. This helps the network learn faster and more stably.
However, removing all variation in the inputs can be too restrictive. The gamma and beta parameters allow the network to restore the original input distribution if needed.
Gamma acts as a scaling factor and beta acts as an offset factor. They are learned parameters.
If the original input distribution is useful, the network can learn gamma and beta values that result in the final normalized outputs matching the original unnormalized values.
So gamma and beta give the network flexibility to undo the normalization if it helps optimize the loss function.
To answer your specific question:
After computing the batch normalization, the next step involves introducing the beta and gamma parameters. These parameters are used to scale and shift the normalized values, allowing the network to restore the original input distribution if needed. Beta and gamma are learned parameters, which means they are updated along with the network weights and biases during training.
Beta acts as an offset factor, while gamma acts as a scaling factor. If the original input distribution is useful, the network can learn gamma and beta values that result in the final normalized outputs matching the original unnormalized values. In other words, gamma and beta give the network flexibility to undo the normalization if it helps optimize the loss function.
It’s important to note that beta and gamma are not substituting the weights (W) and biases (b) of the neural network. Instead, they are additional learnable parameters that work alongside the weights and biases to provide more flexibility in the network’s learning process. After normalizing the activations, the batch normalized values are multiplied by gamma and beta is added. These parameters are learned along with the network weights and biases during training. So gamma and beta allow the network to scale and shift the normalized values to restore the original distribution if needed to minimize the loss.
If you have a followup question, please feel free to reply with a followup. I hope my response clarified your question.
Regards,
Can Koz