Why saddle points isn't a problem for gradient descent?

Carlos_Vinicius · April 11, 2022, 2:46pm

Hello!

I have seen in the videos that, when in high dimensional spaces, you are much more likely to run into saddle points than into local optima, thus concluiding to not worry too much about local optima problem. But my question is:
If you run into a saddle point, anyway the derivatives are going to be zero, so, the parameters W’s and b’s don’t get updated, so the algorithm would become stuck in that saddle point, thus being the same problem as local optima, am I not right?
From the equations W := W - alphadW and b := b - alphadb we can see that when dW and db are zero, the parameters don’t get updated, and the values remain the same. So wouldn’t the optimization algorithm get stuck? So why do the classes propose that saddle points aren’t actually a problem except for the plateau problem?

paulinpaloalto · April 11, 2022, 2:59pm

It’s a good point! Exactly as you say, the gradients become zero at a saddle point, so that Gradient Descent will be “stuck”, but only if you get very unlucky and land exactly on the point at which the gradients are zero. Notice that at a saddle point or a local maximum, there are still directions in which the gradients are negative, but that is not true at a local minimum. It turns out that things are essentially impossible for us to visualize in the high dimensions that the cost surfaces are in. The inputs to the scalar valued cost function are all the parameters and there are typically at a minimum hundreds of them and in many cases millions of them. Our brains are just not wired to “see” in more than 3 dimensions. The math here is pretty complicated, but it turns out that it can be shown that for sufficiently high dimensional solution surfaces, there is a range of reasonable solutions. Here is a paper from Yann LeCun’s group that explores this. I cannot claim to have read and understood the paper, but at least reading the Abstract will give you some idea of why Prof Ng makes the statement that in practice it is typically not a problem to get stuck in bad solutions because of saddle points or poor local minima. Here is another thread on this same topic that also links to that same paper.

Carlos_Vinicius · April 11, 2022, 3:23pm

I found that was a very good answer, thanks a lot for your help!

Topic		Replies	Views
Confused on Saddle Points Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	519	August 19, 2023
Saddle Point clarification Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	571	July 19, 2021
How does gradient descent escape from a saddle point? (And what is random perturbation)? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	636	June 19, 2022
Gradient descent fails at local maximum initial values? Supervised ML: Regression and Classification week-1	2	554	June 26, 2022
Local Optima with Gradient Descent Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	553	May 30, 2021

Why saddle points isn't a problem for gradient descent?

Related topics