The relation between scaling and learning rate

Mahmoud_Mohamed4 · March 26, 2023, 7:53pm

Hi everyone,
I have a question, In feature engineering and feature scaling labs, it was said that when features are scaled Gradient descent runs much faster and with higher values of alpha (the learning rate)

this concept is not quite clear to me, how can normalizing the features allow us to use a higher value of alpha compared to when the features weren’t normalized?

And why then does GD run faster?

TMosh · March 26, 2023, 8:09pm

Normalizing the features causes the gradients to all have similar magnitudes.

This lets you use a larger learning rate (alpha) without risking that a single large-magnitude feature will cause the solution to diverge to infinity.

Gradient descent then will run faster because the larger learning rate lets you use fewer iterations.

shanup · March 26, 2023, 11:08pm

@Mahmoud_Mohamed4

Lets look at the problem mathematically. We know that

\frac{\partial J}{\partial w_{j}} = \frac {1} {m} \sum_{i=1}^m (f_{w,b}(\vec{x}^{(i)}) - y^{(i)}).x_j^{(i)}

where j stands for each feature.

Also, we know that the update equation is

w_{j} = w_{j} - \alpha.\frac{\partial J}{\partial w_{j}}

Thus, the 2nd term in the update equation will have \alpha.x_j

Now if the feature x_1 is in 1000’s and feature x_2 is in a single digit, and we apply the same \alpha for every feature, the features with higher order magnitudes will have much higher value for \alpha.x_j and hence a much higher update in each iteration. As a result, the weight update for the higher order magnitude features could be prone to bouncing around and further resulting in divergence. To prevent this from happening, we have to set much lower values of \alpha, such that the weight update for the features with higher order magnitude are kept under control. The downside is that the features with lower order magnitude, on account of using the same \alpha, will have even smaller update values in each iteration, thereby taking longer to converge.

But, if we normalize all the features, we are at liberty to set higher value for \alpha without the fear of the weight updates for any of the features bouncing around and diverging. Normalisation of the features will ensure that the \alpha.x_j, for all values of j, remain within a similar and reasonable range. We are thus able to push the bar higher and higher for \alpha in the case of normalized features, thereby helping Gradient descent to converge faster.

rmwkwok · March 27, 2023, 1:06am

Hi @Mahmoud_Mohamed4

You might have watched the lecture for “Feature scaling part 1” in Course 1 Week 2, but sometimes watching a lecture again in a different timing can give learner a different angle to come up with a working understanding. Here is a slide from that lecture that is most relevant to your question, and in particular, the red arrows which depicts that non-scaled features gave it a difficult time to converge whereas scaled features provided a much more “direct path” to the optimal solution.

As Tom has also explained, with unscaled features, we need to pick very carefully a small enough learning rate to avoid it to diverge in the dimension of small-scale feature (size in feet2). For example, if we look at the upper right graph and along the w_1 direction, we need to make a “walking” step to be around something like 0 ~ 0.2 in order for it not to diverge. The step size is controlled by the learning rate and it has to be small enough to not amplify the step out of acceptable range.

Such small learning rate, however, is not in favor of the w_2 direction (# bedroom) which spans a larger range from 0 ~ 100. A reasonable step for w_2 is likely something 0~20 which is 100 times larger than the acceptable range for w_1. Therefore, under the limitation that both directions use the same learning rate, while a small learning rate lets us walk with reasonable step size in w_1 direction, it is too small for w_2 and because of that, it takes “more time” (or more steps) for w_2 to converge.

If we then look at the bottom right graph which has both features scaled to the same range, now both directions accept a similar step size, therefore, one direction does not need to walk slower to “accomodate” the other direction.

To echo what I have said in the beginning, re-watch the lecture if you have time

Cheers,
Raymond

Topic		Replies	Views
Optional Lab: Feature Engineering and Polynomial Regression (Feature scaling impact on Convergence) Supervised ML: Regression and Classification week-module-2	2	532	September 2, 2022
Is my understanding of Feature Scaling correct? Supervised ML: Regression and Classification week-module-2	3	530	August 12, 2022
About gradient descent and Features scaling Supervised ML: Regression and Classification week-module-2	6	555	August 19, 2022
The Effect of Feature Rescaling on Convergence Supervised ML: Regression and Classification week-module-2	1	488	July 27, 2022
Why feature scaling can make the learning rate large? Supervised ML: Regression and Classification	8	529	July 15, 2022

The relation between scaling and learning rate

Related topics