Since input normalization gives round contours, how is batch norm improving the contour roundness (reason 1 for batch norm)?

David_Farago · August 16, 2022, 7:14pm

I do not understand the first reason given for the effectiveness of batch norm (1st slide of “Normalizing Activations in a Network”), namely more round contours making it simpler for the optimizer. Isn’t input normalization already doing that?

We have learned that input normalization is making the contours in the optimization problem round. Aren’t all W^{[l]}, b^{[l]} independent variables of the optimization problem and thus optimally scaled if the contours are round?

If not, then what are the independent variables in the contour plots shown in the lectures on input normalization? Only W^{[1]}, b^{[1]}? Or only A^{[1]}?

If yes, then what else is there to improve by batch norm in the optimization problem? Maybe scaling by input normalization is causing the independent variables W^{[l]}, b^{[l]} to be scaled to slightly more round contours than without any normalization, but batch norm is further improving the roundness of the contours?

TMosh · August 16, 2022, 7:32pm

Batch norm is applied to each mini-batch, instead of normalizing the entire training set in bulk.

From “Machine Learning Mastery”:

Alternate to Data Preparation
Batch normalization could be used to standardize raw input variables that have differing scales.

If the mean and standard deviations calculated for each input feature are calculated over the mini-batch instead of over the entire training dataset, then the batch size must be sufficiently representative of the range of each variable.

It may not be appropriate for variables that have a data distribution that is highly non-Gaussian, in which case it might be better to perform data scaling as a pre-processing step.

David_Farago · August 16, 2022, 7:41pm

Thanks for the fast reply, TMosh. Your answer helps me in understanding input norm vs batch norm for the input variables.

However, I still do not understand how the first reason (rounder contours) is achieved by batch norm of any layers in case input normalization has already been done.

TMosh · August 16, 2022, 8:57pm

If you already normalized the entire data set, there may be only marginal gains from using batch norm. The mean and standard deviation for each batch is going to be customized for that specific batch of examples, and may be slightly different than for the entire data set.

The “rounder contours” is another way of saying that the magnitude of the features are all going to be very similar - this allows a higher learning rate to be used without risk of the solution diverging during training. So you can use a higher learning rate and fewer iterations, for example.

David_Farago · August 17, 2022, 7:36am

@TMosh, do you mean “If you already normalized the entire data set, there may be only marginal gains from using batch norm for the first layer”? Or maybe “If you already normalized the entire data set, there may be only marginal gains wrt the first reason (rounder contours) from using batch norm at all layers”?

Since we want to optimize J(W_1, b_1, ..., W_L, b_L), “rounder contours” is another way of saying that the magnitude of the gradients \nabla_{W_1, b_1, ..., W_L, b_L} J (not the features) are all going to be very similar, right? So the first reason given for the effectiveness of batch norm at all layers (1st slide of “Normalizing Activations in a Network”) does not hold if the inputs are normalized!?

TMosh · August 17, 2022, 3:50pm

Mostly yes, but it’s going to depend on the statistical behavior of the data set, and also on the size of the batches.

Yes.

Rather I’d say it holds only weakly.

David_Farago · August 17, 2022, 5:12pm

Ok, thanks for your help, Tom.

So my understanding of the loss plots and contours is correct, and both input norm and batch norm contribute to making the contours more round.

Topic		Replies	Views
Optimization methods vs normalizing input features Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	783	February 26, 2022
Why do we run BatchNormalization after Conv2D? Convolutional Neural Networks coursera-platform	3	643	December 31, 2022
Is Batchnorm really necessary? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	611	July 12, 2022
Confusion with Input normalization and batch normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	632	January 22, 2022
Understanding BatchNormalization Convolutional Neural Networks coursera-platform	8	1430	June 6, 2021

Since input normalization gives round contours, how is batch norm improving the contour roundness (reason 1 for batch norm)?

Related topics