I have a few questions on input data and batch normalization:
-
With respect to input data normalization:
Is entire input data normalized (subtract mean, divide by variance) once when doing mini batch optimization, or can each mini-batch input be normalized separately? -
With respect to mini batch normalization (and making \beta^{[l]}, \gamma^{[l]} parameters to be optimized together with W^{[l]}), Prof Andrew talks about normalization for hidden layers (normalizing z^{[1]}, z^{[2]}, \ldots.
Can same be done for z^{[0]} , which is same as mini batch input X and optimize for \beta^{[0]}, \gamma^{[0]} as well?