How does the Gradient Descent approach reconcile between a local minimum and a global minimum? If we really want the global one, how do we avoid getting stuck locally?
Finding a generalized algorithm that always reaches global optima is not possible as far as I know.
I think our first approach should be constructing a loss function which is convex in nature, i.e has no or very less local optima.
Most algorithms like neural networks don’t have a convex loss function, so there is a high chance we may get stuck in a local optima
But even if we get stuck in a local optima there are few ways we can try get out of them,
i) try different initial weights and hope one of them leads to global optima
ii) increase the number of iterations
iii) using stocastic gradient descent as it may help us get out of local optima.
correct me if I’m wrong
Yes this reminds me of the previous edition of the course. The first part of your answer implies trial and error and luck!
But, how do we know which is the global minimum? There can ‘n’ local minima, identifying a global minimum out of that could be a never ending task right?
No, we won’t know until we explicitly compare all of their cost values, which, as you said, could be a never-ending task. So, we seldom or never really target ourselves to that global minimum, instead I think we want a stable and low enough local minimum. A low enough local minimum is one that gives us the best metric performance. The technique to get there is by tuning hyperparameters including initializing our neural network weights differently, then see which hyperparameters configuration gets us the best metric performance on the cv dataset.
Cheers,
Raymond
Yes, as Raymond says, finding the actual global minimum is probably not possible, but the higher level point he also makes is that is not what we really want in any case, since it would most likely represent extreme overfitting on the training set. Remember that what we really want is balanced performance on the cross validation and test data, which is not the same data as the training data. Of course we hope that it has a very similar statistical properties, but it is different. Here’s a thread from DLS from a while ago that discusses these issues in more detail.