Why feature scaling can make the learning rate large?

kkkk · July 6, 2022, 9:44am

Hi, I am doing lab3 in week2.
In lab2, without feature scaling, a learning rate of 1e-6 is too large so it makes the training diverge (cost is increasing while doing gradient descent).

After feature scaling, the model can converge with a very large learning rate like 1e-2.

It’s quite amazing! Can anyone tell the reason behind it?

rmwkwok · July 6, 2022, 10:36am

I love this slide. It is in the C1 W2 video for Feature scaling part 1 at time 6:12

So here the problem of unnormalized features is that your update is more susceptible to overshoot, and from the top right plot, the problem lies in the feature w_1 because it is always the horizontal component of the update arrow that needs to go back and forth.

That’s why you need to choose a small learning rate so that the horizontal component won’t overshoot (it won’t pass beyond the optimal w_1 each time it gets updated). However, the smaller the learning rate is, the slower the update for w_2 will be too, because we have one learning rate for everyone. And if you look at the top right plot again, using a smaller learning rate will makes the update in the w_2 direction very slow too.

The perfect sceneraio is for both weights to get to the optimal values at the same time, so we love the bottom right version.

Raymond

rmwkwok · July 6, 2022, 10:40am

Also, if it overshoots, it can diverge or it can converge. If it doesn’t overshoot, it should converge.

kkkk · July 6, 2022, 12:19pm

Thank you so much! @rmwkwok That helps a lot.

rmwkwok · July 6, 2022, 12:49pm

@kkkk, you are welcome!

cajumago · July 13, 2022, 8:59pm

Hi @rmwkwok,

What is the best option?

x/max
Mean normalization
Z-score normalization

rmwkwok · July 14, 2022, 1:23am

Hi @cajumago,

First, I think min-max normalization is the more general form for “x/max”. Second, usually there is no single best option. As far as the purpose of keeping all features in similar ranges of values is concerned, all three of them are just as good for MLS.

Cheers,
Raymond

David_Adewoyin · July 15, 2022, 1:15pm

this is very much needed, been scrambling my head trying to figure it out.
THANK YOU very much

rmwkwok · July 15, 2022, 1:38pm

You are welcome David

Topic		Replies	Views
Optional Lab: Feature Engineering and Polynomial Regression (Feature scaling impact on Convergence) Supervised ML: Regression and Classification week-2	2	532	September 2, 2022
Is my understanding of Feature Scaling correct? Supervised ML: Regression and Classification week-2	3	528	August 12, 2022
About gradient descent and Features scaling Supervised ML: Regression and Classification week-2	6	553	August 19, 2022
The relation between scaling and learning rate Supervised ML: Regression and Classification week-2	3	536	March 27, 2023
Feature Scaling - When to Scale Supervised ML: Regression and Classification week-2	3	496	July 13, 2023

Why feature scaling can make the learning rate large?

Related topics