Week 3 normalization 2 questions

Andreas_Bertl · January 7, 2022, 10:42pm

Hello, I do not fully understand two aspects of the normalization that Prof. Ng explains in week 3, it would be great if somebody could explain.

a) we normalize the activation layers to avoid big variations in the output data of the previous layer. This makes sense to me for activation functions like Relu where the output can become quite large but why does it make sense for the sigmoid function? With the sigmoid function the output is between 0 and 1 anyway so there will not be major variations in the output data.

b) for the estimation of mu and sigma^2 when using mini-batches, Prof. Ng suggests the use of the exponentially weighted average. I know he stresses that the results are relatively robust to the exact estimation method but why not using the simple average (mean) of the first n mini-batches? The mean is the unbiased estimator and should give the best result, right?

Thank you for your insights!

jwagner · January 8, 2022, 11:53pm

Heh, came here to ask the same question. I can imagine that it might actually be beneficial for more recently used values to be weighted more since the network was more recently trained on them. Might be an interesting thing to test with a little toy network.

Regarding a) I think the effect with the sigmoid is still the same. This is a bit hand wavy. A relatively small shift in an early layer can still be magnified and result in a bigger and bigger changes in subsequent layers with sigmoids. For a layer that expects an input in the range of -0.01 - 0.01 a shift to 0.5 - 0.8 would still be big. It might help even more since the sigmoid will suffer from a vanishing gradient as it saturates so the later layers will have a harder time adapting to the new mean and variance than the would have if the were relu.

Just some thoughts from a fellow student, take them with a big pinch of salt.

Andreas_Bertl · January 9, 2022, 10:34pm

Hi,
many thanks for your thoughts on this!

I am not sure about question b) as I can see no reason why more recently trained batches should have a “truer” mean/variance. Good advise to test it on a network.

I need to think more deeply on your answer on question a); my simple starting point was Prof. Ng’s recommendation in course 1 that (input) values do not need to be normalized if they are relatively close to the range of -1 to +1. If I find the time I will try to test this also in a network.

jwagner · January 9, 2022, 11:05pm

Maybe a slightly absurd example can help here.
Imagine a data set consisting of 2 million examples.
For the first million Y = 1 if x < 0.1, 0 otherwise.
For the second million Y = 1 if x < 0.9, 0 otherwise.
Now imagine training using mini batch (with a size of 128 or so) without shuffling.
At first the decision boundary would move from wherever it was to 0.1.
After starting to train using mini batches from the second million it would slowly move to 0.9.
In the next epoch when you train with examples from the first million it would wander back to 0.1.
So in that contrived case more recent samples would more closely reflect what the model expects.

Now in practice we would of course shuffle our examples before dividing them into mini batches which would remove most of that bias.

I guess in practice beta & gamma will start to settle with the rest of the parameters as the error goes down. In that case the recency bias of the EMA will also start to matter less and less.

Topic		Replies	Views
Batch Normalization with Stochastic Gradient Descent Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	552	February 27, 2022
Batch norm at test time c2w3 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	565	November 2, 2021
Batch Normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	561	May 31, 2021
Using E.W.A. for Estimating Mean and Variance in B.N. at Test Time Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	515	February 28, 2022
Batch norm at test time Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	562	July 16, 2021

Week 3 normalization 2 questions

Related topics