Is my understanding of Feature Scaling correct?

I am going through ’ Optional Lab: Feature scaling and learning rate’ and want to make sure I understand what effect feature scaling has on the model.

In simple terms, does it allow us to have a greater learning rate, alpha, when the features are scaled? In the lab example, the house size is in the 000’s, so that is limiting the learning rate that we can use because the cost function derivative will come out very high and would cause J to diverge.

If we have scaled all of our features then it allows us to use a greater learning rate, which means we converge faster, and there is no risk of runaway cost, J (divergence).


Feature scaling is mainly used so that the weights are updated evenly during gradient descent which results in fewer iterations of gradient descent needed to converge to the minimum / maximum.

I’m not certain that as a rule we can say because we have scaled the features we can increase the learning rate alpha.

From the Details section of the lab we see what happens when the features are not scaled. One feature converges quickly to the min / max while the other features take more iterations:

Thanks. Is there any downside to feature scaling, like you lose information or alter the distribution in a way that affects the results by doing it?

There is no downside, other than you have to apply the same normalization to any new predictions you want to make.