Hello!
I have seen in the videos that, when in high dimensional spaces, you are much more likely to run into saddle points than into local optima, thus concluiding to not worry too much about local optima problem. But my question is:
If you run into a saddle point, anyway the derivatives are going to be zero, so, the parameters W’s and b’s don’t get updated, so the algorithm would become stuck in that saddle point, thus being the same problem as local optima, am I not right?
From the equations W := W - alphadW and b := b - alphadb we can see that when dW and db are zero, the parameters don’t get updated, and the values remain the same. So wouldn’t the optimization algorithm get stuck? So why do the classes propose that saddle points aren’t actually a problem except for the plateau problem?
It’s a good point! Exactly as you say, the gradients become zero at a saddle point, so that Gradient Descent will be “stuck”, but only if you get very unlucky and land exactly on the point at which the gradients are zero. Notice that at a saddle point or a local maximum, there are still directions in which the gradients are negative, but that is not true at a local minimum. It turns out that things are essentially impossible for us to visualize in the high dimensions that the cost surfaces are in. The inputs to the scalar valued cost function are all the parameters and there are typically at a minimum hundreds of them and in many cases millions of them. Our brains are just not wired to “see” in more than 3 dimensions. The math here is pretty complicated, but it turns out that it can be shown that for sufficiently high dimensional solution surfaces, there is a range of reasonable solutions. Here is a paper from Yann LeCun’s group that explores this. I cannot claim to have read and understood the paper, but at least reading the Abstract will give you some idea of why Prof Ng makes the statement that in practice it is typically not a problem to get stuck in bad solutions because of saddle points or poor local minima. Here is another thread on this same topic that also links to that same paper.
I found that was a very good answer, thanks a lot for your help!