Hi @Sepehr_Razavi,
If you don’t normalize and features have very different scales, you can still optimize the model given that you use a sufficiently small learning rate and that’s the cause of taking more steps - smaller learning rate, more steps.
For why we need a smaller learning rate, I think it’s easier to visualize it. This discussion used a slide to explain this…
Raymond