Questions on batch normalization

I have a few questions on input data and batch normalization:

  1. With respect to input data normalization:
    Is entire input data normalized (subtract mean, divide by variance) once when doing mini batch optimization, or can each mini-batch input be normalized separately?

  2. With respect to mini batch normalization (and making \beta^{[l]}, \gamma^{[l]} parameters to be optimized together with W^{[l]}), Prof Andrew talks about normalization for hidden layers (normalizing z^{[1]}, z^{[2]}, \ldots.
    Can same be done for z^{[0]} , which is same as mini batch input X and optimize for \beta^{[0]}, \gamma^{[0]} as well?

  1. Each mini batch of data into the batchnorm layer is standardized as it arrives. Remember that \beta and \gamma are learnt along the way during training. See this as well.
  2. You can technically use a batch norm layer to standardize your data by placing it before any processing layers. I’ve not seen anyone do it instead of preprocessing explicitly.

In Question 1, I am trying to understand if input data X is normalized once (as preprocessing) or each mini-batch input X^{i} is normalized (as preprocessing).

Each mini-batch of data is normalized on the fly (separately) during training, while simultaneously learning the parameters. At test time, the parameters are fixed from the perspective of normalization.

If this is confusing, please watch the lectures. Andrew does a really nice job in providing an overview of the batch normalization process.