Hi team, I have enrolled in the machine learning specialization in Coursera. I have a doubt regarding the gradient descent topic, it is mentioned that gradient descent finds the local minima and not the lowest value of J(W), instead of using gradient descent just looking at the lowest value of J(W) and getting its corresponding W value will give us the most optimal value of W with the lowest error right?
In a 2 dimensional model (using linear regression) its very easy to plot and find the minima (probably) but in neural networks with many dimensions (even thousands) visualization and analysis is unimaginable. And that is why you cannot really find the global minima unless you “get lucky” .
That’s easy to say, but the question is how do you actually implement that? As Gent points out, we frequently have thousands or even millions of individual parameters, each of which is a real number. Meaning we’ve got a very large number of choices for each of those numbers (2^{64} choices if we’re using 64 bit floating point). So exactly how do you go about figuring out what the minimum possible value of J actually is in a case like that?
A lot of smart mathematicians have been thinking about this general problem for a long time and the best they’ve come up with so far is basically Gradient Descent. The general term for that type of algorithm is Conjugate Gradient Methods. We start by learning how to implement the simplest form of that here in DLS C1. Then we learn some more sophisticated techniques that can help with convergence in DLS C2.
There are also more levels of subtlety here: e.g. it’s not clear you actually want the global minimum for the cost, because that will represent very extreme “overfitting” on the training data. Here’s a thread which talks about these issues a bit more and gives references to some papers. If all that’s said there doesn’t make sense right now, please “hold that thought” and listen to what Professor Ng has to tell us in DLS C1 - C5.
Hi Gent, Thanks for your explanation. I am relatively new to this field and i never though about models with multiple dimensions. My initial thought was we are plotting a fixed dimension chart against the cost function J (like J vs W, J vs W+B) so i felt a simple code like min(J) and then take the W and B value at min(J) could have done the task. But it makes sense now, Thank you!
Hi Paul, really appreciate you taking the time to answer my doubt. I understand it now, I overlooked the possibility that the cost function could have n dimensions and we cannot limit the possibility of the value of dimension within a fixed limit. Thank you!