Optimization methods vs normalizing input features

As I understood from previous week: normalizing your input results in a cost function that’s more homogeneous / less skewed. This in turn should result in gradients pointing more in the direction of the minimum.

How does normalizing input compare to all the optimization methods in this week? Are we just trying to speed up training for badly normalized data? Or would these optimization techniques still improve training speed with proper normalized input data?

Cheers,

Nikolaj

Yes! The point is you always do normalization. Why wouldn’t you? It’s inexpensive to do, it frequently helps and at worst it never hurts. But then you have other potential sources of randomness in the parameter updates caused by (e.g.) the choice of minibatch size. The smaller the minibatch size, the more stochastic is the behavior you get from the updates, so further smoothing techniques like RMS Prop and momentum can still help. But every case may be different: there is no “one size fits all” magic combo of hyperparameters that always works. Otherwise we wouldn’t need to worry about understanding most of the things Prof Ng is showing us here in Week 1 and 2 of Course 2. :nerd_face:

Maybe the other point worth mentioning is that the pictures shown above are a bit too simplistic just for the sake of illustrating the point Prof Ng is making there. In “real life” the solution surfaces have shapes a lot more complex than that. Radically non-convex. Here’s a paper from Yann LeCun’s group which discusses issues around that and provides a contrast to the lovely convex shapes shown above. :thinking:

1 Like

Actually this paper shows more visual representations of loss surfaces.

Just my 2 cents, there are some rare cases where normalization of input can hurt, it’s in cases where the input is more or less flat with subtle variations, like the range of possible values for a field are between 0 to 100 but the value for a field varies only between 98.8 to 99 (and the floating point numbers in between). One example of this could be that the data is coming from a sensor and that variation is just the noise in measurement.

If we apply normalization, it’ll immediately magnify the noise to be in range of 0 to 1 (or mean 0 and standard deviation 1, depending on how you normalized) and it’ll make the values at par with all other values which will have similar range post normalization.

This can lead to the feature getting unnecessary importance or slowing down the convergence. But I agree with Paul, we absolutely must normalize before we are feeding data into neural networks but it might help to check the distribution of the input measurements to ensure we are not getting into this scenario. If we find an input measurement with very little variation (but not a flat-line), then we should double check if the variation is significant, valid behavior or just noise.

Hi,

I was revisiting this part again for refreshing my memory.
Reading your comment it triggers a followup question:

You mentioned that little variation in input might just have been some noise that might get ‘amplified’ by applying normalization. But who’s to say that the activations of the previous layer don’t also only have small variation induced by some kind of noise? We would be amplifying this noise as well then by applying batch normalization. We would have no idea that this is the case, as we don’t understand what the activations actually mean. Is this a ‘known’ issue? Or am I missing something obvious here?

Cheers,
Nikolaj

Hi Nikolaj,

That’s a good question. My understanding is that there are 2 things at play here when it comes to batch normalization:

  1. The problem of noise being amplified is not a very common phenomenon and it is usually always a good idea to normalize the data as mentioned by Paul (though we should be aware of this potential pitfall). Similarly, the probability of this occurring during batch normalization process is also very low.

  2. If you refer Andrew Sir’s lecture on Batch Normalization, the data is normalized per batch and this also kind of acts as a regularizer (though unintended side effect). In some of the batches you may see the issue of noise getting magnified but in most you won’t, also the network also keep learning the parameters of batch normalization while training to optimize the metric at hand.

So, while the same problem exists it’s not big enough to out-weigh the benefits of batch normalization.