The relation between scaling and learning rate

Hi @Mahmoud_Mohamed4

You might have watched the lecture for “Feature scaling part 1” in Course 1 Week 2, but sometimes watching a lecture again in a different timing can give learner a different angle to come up with a working understanding. Here is a slide from that lecture that is most relevant to your question, and in particular, the red arrows which depicts that non-scaled features gave it a difficult time to converge whereas scaled features provided a much more “direct path” to the optimal solution.

As Tom has also explained, with unscaled features, we need to pick very carefully a small enough learning rate to avoid it to diverge in the dimension of small-scale feature (size in feet2). For example, if we look at the upper right graph and along the w_1 direction, we need to make a “walking” step to be around something like 0 ~ 0.2 in order for it not to diverge. The step size is controlled by the learning rate and it has to be small enough to not amplify the step out of acceptable range.

Such small learning rate, however, is not in favor of the w_2 direction (# bedroom) which spans a larger range from 0 ~ 100. A reasonable step for w_2 is likely something 0~20 which is 100 times larger than the acceptable range for w_1. Therefore, under the limitation that both directions use the same learning rate, while a small learning rate lets us walk with reasonable step size in w_1 direction, it is too small for w_2 and because of that, it takes “more time” (or more steps) for w_2 to converge.

If we then look at the bottom right graph which has both features scaled to the same range, now both directions accept a similar step size, therefore, one direction does not need to walk slower to “accomodate” the other direction.

To echo what I have said in the beginning, re-watch the lecture if you have time :wink:

Cheers,
Raymond

1 Like