How does scaling makes the gradient decent faster?

Hi @Sepehr_Razavi,

If you don’t normalize and features have very different scales, you can still optimize the model given that you use a sufficiently small learning rate and that’s the cause of taking more steps - smaller learning rate, more steps.

For why we need a smaller learning rate, I think it’s easier to visualize it. This discussion used a slide to explain this…

Raymond