Hi everyone,
I have a question, In feature engineering and feature scaling labs, it was said that when features are scaled Gradient descent runs much faster and with higher values of alpha (the learning rate)
this concept is not quite clear to me, how can normalizing the features allow us to use a higher value of alpha compared to when the features weren’t normalized?
Thus, the 2nd term in the update equation will have \alpha.x_j
Now if the feature x_1 is in 1000’s and feature x_2 is in a single digit, and we apply the same \alpha for every feature, the features with higher order magnitudes will have much higher value for \alpha.x_j and hence a much higher update in each iteration. As a result, the weight update for the higher order magnitude features could be prone to bouncing around and further resulting in divergence. To prevent this from happening, we have to set much lower values of \alpha, such that the weight update for the features with higher order magnitude are kept under control. The downside is that the features with lower order magnitude, on account of using the same \alpha, will have even smaller update values in each iteration, thereby taking longer to converge.
But, if we normalize all the features, we are at liberty to set higher value for \alpha without the fear of the weight updates for any of the features bouncing around and diverging. Normalisation of the features will ensure that the \alpha.x_j, for all values of j, remain within a similar and reasonable range. We are thus able to push the bar higher and higher for \alpha in the case of normalized features, thereby helping Gradient descent to converge faster.
You might have watched the lecture for “Feature scaling part 1” in Course 1 Week 2, but sometimes watching a lecture again in a different timing can give learner a different angle to come up with a working understanding. Here is a slide from that lecture that is most relevant to your question, and in particular, the red arrows which depicts that non-scaled features gave it a difficult time to converge whereas scaled features provided a much more “direct path” to the optimal solution.
As Tom has also explained, with unscaled features, we need to pick very carefully a small enough learning rate to avoid it to diverge in the dimension of small-scale feature (size in feet2). For example, if we look at the upper right graph and along the w_1 direction, we need to make a “walking” step to be around something like 0 ~ 0.2 in order for it not to diverge. The step size is controlled by the learning rate and it has to be small enough to not amplify the step out of acceptable range.
Such small learning rate, however, is not in favor of the w_2 direction (# bedroom) which spans a larger range from 0 ~ 100. A reasonable step for w_2 is likely something 0~20 which is 100 times larger than the acceptable range for w_1. Therefore, under the limitation that both directions use the same learning rate, while a small learning rate lets us walk with reasonable step size in w_1 direction, it is too small for w_2 and because of that, it takes “more time” (or more steps) for w_2 to converge.
If we then look at the bottom right graph which has both features scaled to the same range, now both directions accept a similar step size, therefore, one direction does not need to walk slower to “accomodate” the other direction.
To echo what I have said in the beginning, re-watch the lecture if you have time