I have a question on the feature scaling example provided in the feature lab. It is mentioned that feature scaling allows the algorithm to converge faster.
However, in the example provided after feature scaling, the # of iterations is 100,000 compared to 10,000 without feature scaling and even the learning rate has been increased.
To check only the effect of feature scaling, I kept both # Iterations and Learning rate constant (i.e. same value used before feature scaling).
As shown below, the algorithm after feature scaling isn’t even close to converging with the same values that algorithm without feature scaling converged.
Can someone please explain why this is happening and how feature scaling helps convergence?
The benefit of feature normalization is not that it speeds up performance given the same conditions.
The benefit is that normalization allows you to use a larger learning rate without risk of the solution diverging. This lets you use a larger learning rate and fewer iterations.
Our primary need is to converge faster to the optimal cost.
As such, there are no limitations on what value we can set for the learning rate. However, the issue of divergence does become a limiting factor while trying to set very high values for the learning rate.
As @TMosh has explained, by normalizing the features, it gives us more leeway to set higher learning rates, but without the risk of diverging.
So, it is not the normalization of features in itself that speeds up the convergence - Rather, it is that normalization relaxes the upper limit on the learning rate. We take advantage of this by setting a higher learning rate which in turn facilitates faster convergence.