Aannnd we have a diagram showing “normalization” (both input normalization and batch normalization) in a 2-layer ANN, at least for the forward propagation phase.
The backward propagation seems a bit more difficult
For input normalization, would one really compute the mean and variance over the whole example set?
It would probably be sufficient to select a “large enough” sample and compute the sample mean and sample variance.
Or maybe even normalize the input on a per-batch basis.