I do not understand the first reason given for the effectiveness of batch norm (1st slide of “Normalizing Activations in a Network”), namely more round contours making it simpler for the optimizer. Isn’t input normalization already doing that?

We have learned that input normalization is making the contours in the optimization problem round. Aren’t all W^{[l]}, b^{[l]} independent variables of the optimization problem and thus optimally scaled if the contours are round?

If not, then what are the independent variables in the contour plots shown in the lectures on input normalization? Only W^{[1]}, b^{[1]}? Or only A^{[1]}?

If yes, then what else is there to improve by batch norm in the optimization problem? Maybe scaling by input normalization is causing the independent variables W^{[l]}, b^{[l]} to be scaled to slightly more round contours than without any normalization, but batch norm is further improving the roundness of the contours?

Batch norm is applied to each mini-batch, instead of normalizing the entire training set in bulk.

From “Machine Learning Mastery”:

Alternate to Data Preparation
Batch normalization could be used to standardize raw input variables that have differing scales.

If the mean and standard deviations calculated for each input feature are calculated over the mini-batch instead of over the entire training dataset, then the batch size must be sufficiently representative of the range of each variable.

It may not be appropriate for variables that have a data distribution that is highly non-Gaussian, in which case it might be better to perform data scaling as a pre-processing step.

Thanks for the fast reply, TMosh. Your answer helps me in understanding input norm vs batch norm for the input variables.

However, I still do not understand how the first reason (rounder contours) is achieved by batch norm of any layers in case input normalization has already been done.

If you already normalized the entire data set, there may be only marginal gains from using batch norm. The mean and standard deviation for each batch is going to be customized for that specific batch of examples, and may be slightly different than for the entire data set.

The “rounder contours” is another way of saying that the magnitude of the features are all going to be very similar - this allows a higher learning rate to be used without risk of the solution diverging during training. So you can use a higher learning rate and fewer iterations, for example.

@TMosh, do you mean “If you already normalized the entire data set, there may be only marginal gains from using batch norm for the first layer”? Or maybe “If you already normalized the entire data set, there may be only marginal gains wrt the first reason (rounder contours) from using batch norm at all layers”?

Since we want to optimize J(W_1, b_1, ..., W_L, b_L), “rounder contours” is another way of saying that the magnitude of the gradients \nabla_{W_1, b_1, ..., W_L, b_L} J (not the features) are all going to be very similar, right? So the first reason given for the effectiveness of batch norm at all layers (1st slide of “Normalizing Activations in a Network”) does not hold if the inputs are normalized!?