About gradient descent and Features scaling

hadeer.awad02 · August 12, 2022, 7:46pm

Can anyone explain why the gradient descent (the red lines) differs in the two graphs ? in another word why the gradient will converge in the second figure faster than the first one ?

TMosh · August 12, 2022, 8:13pm

In the first graph, the gradient descent path is not sketched correctly. It will be perpendicular to the contour lines at each iteration, so the path to the minimum will be curved (if the learning rate is “low enough”) or it will oscillate back and forth (or perhaps diverge) if the learning rate is too large.

The bottom curve shows that the gradients always point toward the minimum. This gives you greater flexibility in setting the learning rate for rapid convergence.

HulloMrChips · August 18, 2022, 8:32am

Thanks @TMosh I wonder if I can follow this up with another question?

I’m trying to understand why the scale of the features matter. The illustration using the two graphs above is a bit misleading as you point out. Surely if you scale the features, the effect is simply that the GD “steps” are scaled appropriately too? So an adjusted illustration might look like this:

… or am I taking this illustration too literally?

HulloMrChips · August 18, 2022, 8:46am

… is feature scaling necessary because we use the same alpha (learning rate) value for all parameters? If we specified a different alpha for each parameter, perhaps it wouldn’t be necessary to scale the features?

TMosh · August 18, 2022, 3:29pm

Sort of. It’s a sketch that shows one potential issue with not normalizing the data.

The red arrows in the oval plot are actually depicting what happens if the learning rate is too high. It can cause the new weight values to over-compensate. In the worst case, the solution could diverge to infinity rather than converge to the center.

TMosh · August 18, 2022, 3:31pm

First, feature scaling isn’t strictly necessary. But if you don’t normalize the features, you may have to use an extremely small learning rate, along with an extremely huge number of iterations.

We don’t have a simple method that allows for separate learning rates for each feature. There’s no simultaneous way to determine the best learning rates along with learning the best weight values. Any solution that could accomplish this would be way more complicated than just normalizing the data set first.

HulloMrChips · August 19, 2022, 11:28am

Great - thanks for this insight. That last point of yours resonates.

Topic		Replies	Views
Is my understanding of Feature Scaling correct? Supervised ML: Regression and Classification week-module-2	3	530	August 12, 2022
The relation between scaling and learning rate Supervised ML: Regression and Classification week-module-2	3	537	March 27, 2023
Week 2 Lab 3 Question About Feature Scaling Supervised ML: Regression and Classification week-module-2	1	288	January 8, 2024
Step size vs feature scaling Supervised ML: Regression and Classification week-module-2	1	493	September 1, 2022
Optional Lab: Feature Engineering and Polynomial Regression (Feature scaling impact on Convergence) Supervised ML: Regression and Classification week-module-2	2	532	September 2, 2022

About gradient descent and Features scaling

Related topics