Why normalization helps


So, in the lecture, without normalization, the contour plot had an elongated shape. So, during gradient descent, we would proceed in a direction that is perpendicular to the contour lines. However, because of the elongated shape, that direction is unlikely to lead us directly to the minimum of the cost function. Therefore, convergence can take much longer.

However, after regularization, the contour map becomes circular. Gradient descent will still take us in a direction that is perpendicular to the contour lines, but this time it will directly lead to the minimum.

Is this understanding correct?


When the contour has an elongated shape, we have to use a smaller learning rate since the updates can be bumpy and so will take a longer time to converge.
When we have circular contours, no matter where the gradient descent starts from, convergence will be faster and we can use a higher learning rate when compared to the previous scenario.

Gradient descent weight updates are perpendicular to the contours in both cases. See this visualization as well.

For a practical example, see this

Thank you! I understand now!