Because when I change the scale of the functions, the algorithm converges faster

Hello, could someone explain to me mathematically why when I change the scale of the features, the gradient descent algorithm converges faster.

Fluctuations on the down trending path of the gradient decent toward the optima become smaller and the chance to overshot and by pass the optima is smaller when the features are scaled.

In the deep learning specialization Prof Andrew explains it quite nicely though, check it out.