Why feature scaling can make the learning rate large?

I love this slide. It is in the C1 W2 video for Feature scaling part 1 at time 6:12

So here the problem of unnormalized features is that your update is more susceptible to overshoot, and from the top right plot, the problem lies in the feature w_1 because it is always the horizontal component of the update arrow that needs to go back and forth.

That’s why you need to choose a small learning rate so that the horizontal component won’t overshoot (it won’t pass beyond the optimal w_1 each time it gets updated). However, the smaller the learning rate is, the slower the update for w_2 will be too, because we have one learning rate for everyone. And if you look at the top right plot again, using a smaller learning rate will makes the update in the w_2 direction very slow too.

The perfect sceneraio is for both weights to get to the optimal values at the same time, so we love the bottom right version.

Raymond

6 Likes