We had couple of doubts sir. Can u please help to clarify it ?
In very high dimensional space of parameters, do we have multiple saddle points or we end with always only one saddle point ?
Why the algorithm need to get off the plateau region ? Because at the saddle point gradient is zero, so it means we reached the global minimum. If so we reached the global minimum means why the algorithm needs to get off the plateau region ?
It sounds like you don’t understand what is meant by a “saddle point”. Here’s the Wikipedia article on the subject. The point (pun partially intended) is that a saddle point is not a local extremum at all: it is a point on the cost surface at which the gradient is zero, but it is neither a local minimum or a local maximum. So finding this point actually does us no good, which is why it is important to move off this region.
In the very high dimensional spaces that we are dealing with there are very large numbers of local extrema and saddle points. But it turns out that most local extrema that we find with Gradient Descent are likely to be decent solutions. The math behind this is not simple, but here is another thread that discusses the general question of non-convexity which also refers to the paper from Yann LeCun’s group that proves this.