Why feature scaling can make the learning rate large?

Hi, I am doing lab3 in week2.
In lab2, without feature scaling, a learning rate of 1e-6 is too large so it makes the training diverge (cost is increasing while doing gradient descent).

After feature scaling, the model can converge with a very large learning rate like 1e-2.

It’s quite amazing! Can anyone tell the reason behind it?

I love this slide. It is in the C1 W2 video for Feature scaling part 1 at time 6:12

So here the problem of unnormalized features is that your update is more susceptible to overshoot, and from the top right plot, the problem lies in the feature w_1 because it is always the horizontal component of the update arrow that needs to go back and forth.

That’s why you need to choose a small learning rate so that the horizontal component won’t overshoot (it won’t pass beyond the optimal w_1 each time it gets updated). However, the smaller the learning rate is, the slower the update for w_2 will be too, because we have one learning rate for everyone. And if you look at the top right plot again, using a smaller learning rate will makes the update in the w_2 direction very slow too.

The perfect sceneraio is for both weights to get to the optimal values at the same time, so we love the bottom right version.



Also, if it overshoots, it can diverge or it can converge. If it doesn’t overshoot, it should converge.

Thank you so much! @rmwkwok That helps a lot.

@kkkk, you are welcome!

1 Like

Hi @rmwkwok,

What is the best option?

  • x/max
  • Mean normalization
  • Z-score normalization

Hi @cajumago,

First, I think min-max normalization is the more general form for “x/max”. Second, usually there is no single best option. As far as the purpose of keeping all features in similar ranges of values is concerned, all three of them are just as good for MLS.


1 Like

this is very much needed, been scrambling my head trying to figure it out.
THANK YOU very much

1 Like

You are welcome David :slight_smile: