I am having trouble trying to visualize the difference the difference between saddle points and local optima. I am watching the video “The problem of local Optima”. The examples shown in the video of saddle point and local optima seem the same to me. Does anyone have a visual or real world example that I can better picture this ?
The picture is pretty easy to see, as shown in the lecture. The question is what does Gradient Descent do when it hits or gets close to a saddle point. My interpretation of the point here is that from a saddle point or a local maximum, there are still directions in which the gradient is negative, right? Whereas that is not true for a local minimum. So gradient descent should be able to move on and not get stuck, unless you happen to get incredibly unlucky and exactly land on the point at which the gradients are all zero. But one hopes the probability of that is pretty low …
In terms of actually visualizing any of this stuff in the real way it is happening, it’s just hopeless. Meaning that we’re typically dealing with literally hundreds of dimensions at a minimum. And it’s not at all unusual to have thousands or even millions of parameters, right? And if you really want to get crazy, they claim that GPT4 has 1.7 trillion parameters. What do things look like in 1.7 trillion dimensional space?
So we’re left with just visualizing things in 3D, which is all our human brains can handle. That corresponds to 2 parameters, which is pretty pathetic, but we hope that the intuition I stated in the first paragraph still applies.
FWIW here’s a paper from Yann LeCun’s group, which has some math showing that for sufficiently complex models, there exist lots of reasonable solutions in terms of local minima. I don’t claim to understand the math, but please have a look. They do show some nice 3D pictures. Here’s a thread with some more discussion in addition to the link to that paper and some other links that may be worth a look.
Thanks for clearing it up for me, it makes sense now !