Hello, I do not fully understand two aspects of the normalization that Prof. Ng explains in week 3, it would be great if somebody could explain.
a) we normalize the activation layers to avoid big variations in the output data of the previous layer. This makes sense to me for activation functions like Relu where the output can become quite large but why does it make sense for the sigmoid function? With the sigmoid function the output is between 0 and 1 anyway so there will not be major variations in the output data.
b) for the estimation of mu and sigma^2 when using mini-batches, Prof. Ng suggests the use of the exponentially weighted average. I know he stresses that the results are relatively robust to the exact estimation method but why not using the simple average (mean) of the first n mini-batches? The mean is the unbiased estimator and should give the best result, right?
Heh, came here to ask the same question. I can imagine that it might actually be beneficial for more recently used values to be weighted more since the network was more recently trained on them. Might be an interesting thing to test with a little toy network.
Regarding a) I think the effect with the sigmoid is still the same. This is a bit hand wavy. A relatively small shift in an early layer can still be magnified and result in a bigger and bigger changes in subsequent layers with sigmoids. For a layer that expects an input in the range of -0.01 - 0.01 a shift to 0.5 - 0.8 would still be big. It might help even more since the sigmoid will suffer from a vanishing gradient as it saturates so the later layers will have a harder time adapting to the new mean and variance than the would have if the were relu.
Just some thoughts from a fellow student, take them with a big pinch of salt.
I am not sure about question b) as I can see no reason why more recently trained batches should have a “truer” mean/variance. Good advise to test it on a network.
I need to think more deeply on your answer on question a); my simple starting point was Prof. Ng’s recommendation in course 1 that (input) values do not need to be normalized if they are relatively close to the range of -1 to +1. If I find the time I will try to test this also in a network.
Maybe a slightly absurd example can help here.
Imagine a data set consisting of 2 million examples.
For the first million Y = 1 if x < 0.1, 0 otherwise.
For the second million Y = 1 if x < 0.9, 0 otherwise.
Now imagine training using mini batch (with a size of 128 or so) without shuffling.
At first the decision boundary would move from wherever it was to 0.1.
After starting to train using mini batches from the second million it would slowly move to 0.9.
In the next epoch when you train with examples from the first million it would wander back to 0.1.
So in that contrived case more recent samples would more closely reflect what the model expects.
Now in practice we would of course shuffle our examples before dividing them into mini batches which would remove most of that bias.
I guess in practice beta & gamma will start to settle with the rest of the parameters as the error goes down. In that case the recency bias of the EMA will also start to matter less and less.