Saddle Point clarification

It sounds like you don’t understand what is meant by a “saddle point”. Here’s the Wikipedia article on the subject. The point (pun partially intended) is that a saddle point is not a local extremum at all: it is a point on the cost surface at which the gradient is zero, but it is neither a local minimum or a local maximum. So finding this point actually does us no good, which is why it is important to move off this region.

In the very high dimensional spaces that we are dealing with there are very large numbers of local extrema and saddle points. But it turns out that most local extrema that we find with Gradient Descent are likely to be decent solutions. The math behind this is not simple, but here is another thread that discusses the general question of non-convexity which also refers to the paper from Yann LeCun’s group that proves this.