Why normalization helps

When the contour has an elongated shape, we have to use a smaller learning rate since the updates can be bumpy and so will take a longer time to converge.
When we have circular contours, no matter where the gradient descent starts from, convergence will be faster and we can use a higher learning rate when compared to the previous scenario.

Gradient descent weight updates are perpendicular to the contours in both cases. See this visualization as well.

For a practical example, see this