Have doubts in local minima in Gradient descent

That’s easy to say, but the question is how do you actually implement that? As Gent points out, we frequently have thousands or even millions of individual parameters, each of which is a real number. Meaning we’ve got a very large number of choices for each of those numbers (2^{64} choices if we’re using 64 bit floating point). So exactly how do you go about figuring out what the minimum possible value of J actually is in a case like that?

A lot of smart mathematicians have been thinking about this general problem for a long time and the best they’ve come up with so far is basically Gradient Descent. The general term for that type of algorithm is Conjugate Gradient Methods. We start by learning how to implement the simplest form of that here in DLS C1. Then we learn some more sophisticated techniques that can help with convergence in DLS C2.

There are also more levels of subtlety here: e.g. it’s not clear you actually want the global minimum for the cost, because that will represent very extreme “overfitting” on the training data. Here’s a thread which talks about these issues a bit more and gives references to some papers. If all that’s said there doesn’t make sense right now, please “hold that thought” and listen to what Professor Ng has to tell us in DLS C1 - C5.

1 Like