(The post has been removed by Admin)
Hi, @NocturneJay.
Would you mind editing your post so that it doesn’t give away the answer? I’ll try to answer in the same way.
I think what you’re missing is that a pair of parameters Gamma and Beta are learnt for each output. They are indeed represented as vectors in some of the lectures, which is probably what confused you:
You may want to take a look at page 3 of the Batch Normalization paper if it’s still not clear.
Let me know if that helped
Sorry for leaking the answer, but it seems that I can not edit my post anymore.
Thanks for your reply. @nramon
What really confuses me is that for a layer, let’s say with 3 hidden units and batch size of m, what are the shape of Gamma and Beta, respectively.
From the paper you referred to (Page 3 Algorithm 1), I think after normalization, X ~ N (0, 1) (where N(0, 1) refers to a standard normal distribution), so to make the batch of x have the same distribution, each component (or scalar) of Gamma or Beta should be the same. Am I right?
No worries, @NocturneJay.
In that case, the hidden layer’s output would have shape (3, m), right? Gamma and Beta would both have shape (3, 1), regardless of m, and would be “stretched” to compute the element-wise product and the addition through broadcasting.
I think you’ll understand it better if you play with the following code:
>>> import numpy as np
>>> gamma = np.random.rand(3, 1)
>>> beta = np.random.rand(3, 1)
>>> m = 1
>>> z_norm = np.random.randn(3, m)
>>> z_tilde = gamma * z_norm + beta
>>> m = 32
>>> z_norm = np.random.randn(3, m)
>>> z_tilde = gamma * z_norm + beta
Yes, after normalization the outputs are distributed as N(0, 1). Gamma and Beta precisely allow them to have different distributions if that’s the optimal thing to do!
Thank you so much @nramon. Code really helps.
I think I used to be stuck in the opinion that after batch norm, \tilde z s should have the same distribution. But actually different Gamma and Beta can change each output’s distribution.
Exactly!
Glad I could help