How does gradient descent escape from a saddle point? (And what is random perturbation)?

In the last lecture of DLS Course 2 week 2 about the problems of local optima, Andrew argues that local minima are not a problem because they are much rarer than saddle points, and then depicts how gradient descent enters and then exists a saddle point.

The saddle point is exited “because of a random perturbation”. What does that mean?

I could not find anything about this anywhere – besides Why saddle points isn't a problem for gradient descent? and Cost function stuck at local minima - #3 by kenb, which are also helpful but do not answer my question.

Is it the case that the saddle point is not entered precisely, so being an epsilon away from the saddle point is sufficient to “exit” it again after sufficiently many gradient descent iterations? Or maybe even when entering the saddle point exactly simply due to numerical imprecision? Or is imprecision caused on purpose by adding small noise, for the purpose of exiting saddle points?

Your guesses are right.

It is really unlikely, as you said, that an iteration drives you precisely on the point where the derivatives are all exactly zero. So you will more likely slide on one side or the other of the saddle point (in a 2 dimensional picture).

The higher the number of variables you have, the lower the chances that, on a critical point, each principal direction (that you obtain by diagonalizing the Hessian, assuming it is non degenerate) correspond identicaly to a maximum or a minimum along this one

Hope that helps!

Thanks a lot Nicolas.

Is that what Andrew meant by “random perturbation”, or is there another aspect to it (e.g. imprecision on purpose or by accident)?

I think it’s that. Random perturbation during optimization would be that there is always any little numerical value added to a theoreticaly expected one