Local Optima with Gradient Descent

In the week 2 lecture regarding local optima, Dr. Ng says that at points where the derivative is 0 the function can be concave or convex.

I assume that he’s referring to a 2D cross section when he says this, and if so, why can’t that cross section have the point as a point of inflection (so the function would be increasing on one side and decreasing on the other)? Thanks in advance!

You’re right that there are more possibilities than just concave or convex. It can also be “none of the above”. But there are two levels to the point here: 1) we really only want to find the local optima where the surface is convex and 2) these are all “local” optima that we are finding, meaning that there is no way to guarantee that the point we converge to with Gradient Descent is the global minimum. There is a lot of deep math here that Prof Ng doesn’t have time to cover and he also specifically has designed these courses to be accessible without requiring grad school level math background. He does make the comment that in “real life” situations the problem of finding local minima turns out not to be that serious an issue.

The reasons for this also involve math beyond the scope here, but there are a couple of things that can be said about this that may help even without getting into the actual math:

  1. The global versus local minimum issue is really not a problem: even if we were able to find the real global minimum of the cost on the training set, that probably would not be a good solution because it would represent very serious “overfitting” on the training data.

  2. There has been some really interesting work from Prof Yann LeCun’s group that shows that for networks with large enough numbers of parameters, it turns out that the local minima are constrained into a fairly narrow band that actually does represent pretty good solutions. Here’s the paper. I don’t claim to have actually read and understood it, but you can get the gist just by reading the abstract.