Here is the recap of Andew’s talk.
There are two steps for Batch Normalization.
- normalize input (output from a previous layer) to have mean=0, variance=1
- shift/scale normalized data with \gamma and \beta. (\gamma and \beta are trainable.)
Sometimes, normalized data with mean=0, variance=1, may not be appropriate. For such a case, we have an option to slightly change it with \gamma and \beta.
Then, let’s look at equations of Batch normalization.
At first, we calculate mean and variance of output from a previous layer
\mu = \frac{1}{m}\sum_{i=1}^{m}z^{(i)} \\
\sigma^2 = \frac{1}{m}\sum_{i=1}^m(z^{(i)} - \mu)^2 \\
Then, normalize whole data with using the above \mu and \sigma.
z_{norm}^{(i)} = \frac{z^{(i)}- \mu}{\sqrt{\sigma^2+\epsilon}} \\
\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta
If \gamma and \beta are following values;
\gamma = \sqrt{\sigma^2 + \epsilon} \ , \ \ \beta = \mu
Then,
\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta = \sqrt{\sigma^2 + \epsilon} \cdot \frac{z^{(i)}- \mu}{\sqrt{\sigma^2+\epsilon}} + \mu = z^{(i)}
As you see, there is no change in data (output from a previous layer).
\gamma and \mu are trainable variables. So, the above may be the result of training or set intentionally. In any cases, this says that we do not need to transform any.
In general, any transformation , including normalization, may cause some losses of important characteristics. In this sense, this can be said as optimal values, I think.