How does scaling makes the gradient decent faster?

I actually understand the things that are said in the course, which concerns with taking fewer steps when our contour plots are more similar to circles. But I want to know if there’s any mathematical explanations for this statement or it is only derived from comparing the results of a scaled data set and non scaled one?
Thanks.

Hi @Sepehr_Razavi,

If you don’t normalize and features have very different scales, you can still optimize the model given that you use a sufficiently small learning rate and that’s the cause of taking more steps - smaller learning rate, more steps.

For why we need a smaller learning rate, I think it’s easier to visualize it. This discussion used a slide to explain this…

Raymond

Thank you very much :pray:

You’re welcome @Sepehr_Razavi