Cost function stuck at local minima

It is an important question, but the answer has lots of layers to it.

For the simple case of Logistic Regression, the cost function is actually convex, so it has a single global minimum and no local minima. Once we graduate to real Neural Networks, though, that is no longer true. The cost surfaces are not convex and there can be lots of local optima.

One high level point to make is that convergence (even to a local minimum) is never guaranteed: if you pick a learning rate that is too high, you can get oscillation or even divergence. But that assumes you are using a fixed learning rate algorithm like the one Prof Ng has shown us here. There are more sophisticated versions of Gradient Descent that use adaptive techniques to control the learning rate.

But assuming you get convergence, you are correct that you may be at a local minimum. There is actually no practical way to tell whether the local minimum you found is close to the global minimum or not. This is only the first course in this series, so there are too many things to cover and Prof Ng does not go into much detail here about these issues but he will say more later. Here he just mentions that the “local minimum” problem actually turns out not to be that big a problem in general. The mathematics here gets pretty deep and is beyond the scope of this course, but here is a paper from Yann LeCun’s group which explains some mathematics showing that sufficiently complex neural networks have reasonable solutions even though loss surfaces are extremely complicated and non-convex.

The point about it being difficult to tell whether a local minimum is the global minimum is that the number of solutions that yield local minima is extremely large. Here is a thread which discusses “weight space symmetry”, which is the way to see that the solution space is extremely large. But the bigger point here is that actually finding the global minimum is probably not what you want in any case, since it would probably represent extreme overfitting on the training set. The other important point is that we don’t actually use the cost value J to assess the performance of a trained model: we use the actual prediction accuracy of the model on the training, cross validation and test datasets. Of course the cost function is critical in that the gradients of the cost function drive the back propagation process. So the function matters, but the actual J value is not really useful for anything other than as an easy proxy for whether your convergence is working or not. Another point is that the J values are not “portable” in the sense that just telling you what the J value is for one network does not make it comparable to another network and just knowing the J value doesn’t really tell you anything about the prediction accuracy by itself.

But the overall point is that the Yann LeCun paper shows that this is not really a problem for most of the deep networks we will use in practice.