Why batch norm works?

I have seen other topics here about how works the batch norm but I do not really see the answer of my question. So I have the following question:

I the video Andrew says that if we are in layer 2 then we will have that Z2,Z2 will change during the updates but if we use batch norm we have that their variance and mean will remain the same, which in the batch norm are the Beta[2] and gamma[2]. I do not understand this since the gamma and beta are also updated during the training which is the whole point of the batch norm. So then these mean and variance also change, maybe it can be seen that they do change but they change smoothly?

Yes. I think the key takeaway is

So from the perspective of the third hidden layer, these hidden unit values are changing all the time, and so it’s suffering from the problem of covariate shift that we talked about on the previous slide. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around.