There is one global value of γ∈R and one global value of β∈R for each layer, and applies to all the hidden units in that layer.
How is this statement false as we set a single value of γ and β for a paricular layer, if I am not wrong?
There is one global value of γ∈R and one global value of β∈R for each layer, and applies to all the hidden units in that layer.
How is this statement false as we set a single value of γ and β for a paricular layer, if I am not wrong?
Hi, @SwadhinNagulpelli.
If you don’t mind, please delete the last sentence, since it gives away the answer.
The expression from the original paper may be clearer:
Let me know if that helped
I’m wondering the same thing. Is it because of the word “global”? it says “there’s a global X for each layer” though, so I don’t really know what to think about it.
I think the sentence “one global value” is the problem, because \beta and \gamma should be different between hidden units in one layer. Like parameter W and b, \beta and \gamma are vectors in a layer rather than “one global value”
Interesting. I understood it by just viewing units in a hidden layer as features, just as features in an example (the x variable you feed to neural network).
Now, every feature has its own mean and variance. The formulas from the lectures are:
\mu = \frac{1}{m} \sum_i z^{(i)}
\sigma^2 = \frac{1}{m} \sum_i (z^{(i)} - \mu)^2
z_{norm}^{i} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}
\tilde z^{(i)} = \gamma z_{norm}^{i} + \beta
\mu and \sigma are vectors of the length equal to the number of features. When you learn the values of \gamma and \beta they should be vectors too, since every element of those vectors is for one of the features (hidden units). I think that is the reason there should be element-wise product and summation in the formula you gave.
Henrikh