Hi, I am doing lab3 in week2.
In lab2, without feature scaling, a learning rate of 1e-6 is too large so it makes the training diverge (cost is increasing while doing gradient descent).
After feature scaling, the model can converge with a very large learning rate like 1e-2.
It’s quite amazing! Can anyone tell the reason behind it?
So here the problem of unnormalized features is that your update is more susceptible to overshoot, and from the top right plot, the problem lies in the feature w_1 because it is always the horizontal component of the update arrow that needs to go back and forth.
That’s why you need to choose a small learning rate so that the horizontal component won’t overshoot (it won’t pass beyond the optimal w_1 each time it gets updated). However, the smaller the learning rate is, the slower the update for w_2 will be too, because we have one learning rate for everyone. And if you look at the top right plot again, using a smaller learning rate will makes the update in the w_2 direction very slow too.
The perfect sceneraio is for both weights to get to the optimal values at the same time, so we love the bottom right version.
First, I think min-max normalization is the more general form for “x/max”. Second, usually there is no single best option. As far as the purpose of keeping all features in similar ranges of values is concerned, all three of them are just as good for MLS.