Can anyone explain why the gradient descent (the red lines) differs in the two graphs ? in another word why the gradient will converge in the second figure faster than the first one ?

In the first graph, the gradient descent path is not sketched correctly. It will be perpendicular to the contour lines at each iteration, so the path to the minimum will be curved (if the learning rate is â€ślow enoughâ€ť) or it will oscillate back and forth (or perhaps diverge) if the learning rate is too large.

The bottom curve shows that the gradients always point toward the minimum. This gives you greater flexibility in setting the learning rate for rapid convergence.

Thanks @TMosh I wonder if I can follow this up with another question?

Iâ€™m trying to understand why the scale of the features matter. The illustration using the two graphs above is a bit misleading as you point out. Surely if you scale the features, the effect is simply that the GD â€śstepsâ€ť are scaled appropriately too? So an adjusted illustration might look like this:

â€¦ or am I taking this illustration too literally?

â€¦ is feature scaling necessary because we use the same alpha (learning rate) value for all parameters? If we specified a different alpha for each parameter, perhaps it wouldnâ€™t be necessary to scale the features?

Sort of. Itâ€™s a sketch that shows one potential issue with not normalizing the data.

The red arrows in the oval plot are actually depicting what happens if the learning rate is too high. It can cause the new weight values to over-compensate. In the worst case, the solution could diverge to infinity rather than converge to the center.

First, feature scaling isnâ€™t strictly necessary. But if you donâ€™t normalize the features, you may have to use an extremely small learning rate, along with an extremely huge number of iterations.

We donâ€™t have a simple method that allows for separate learning rates for each feature. Thereâ€™s no simultaneous way to determine the best learning rates along with learning the best weight values. Any solution that could accomplish this would be way more complicated than just normalizing the data set first.

Great - thanks for this insight. That last point of yours resonates.