how cost function is always able to find the local optima, whereas I feel suppose it is stuck at local minima, then how to solve this problem?
Hi @Raja_Singh, at local minima, the function value is smaller than the nearby points. While training neural networks, we try to minimise the loss function. This gives us the real measure of how far we are from getting the actual value where the model will perform it’s best within the given dataset. Hope, this reply suffices your query.
Hi, @Raja_Singh. It’s a difficult problem. Unless your cost surface is known to have a unique global minimum (e.g. simple regression with a cost function built on mean-square-error loss), there are no guarantees unless…wait for it…you have an infinite amount of time and maybe a quantum computer.
You will gain more insight into this in Course 2, in particular, with “stochastic gradient descent.” Deep Learning had many engineering aspects to it, where the quality of the solution is gauged by various performance metrics and not whether you have attained a global optimum–which is typically unknowable.
I came from a discipline where uniqueness proofs were once critical. So I too, had a natural discomfort with this idea. Let’s see what you think after the second course of the specialization.
It is an important question, but the answer has lots of layers to it.
For the simple case of Logistic Regression, the cost function is actually convex, so it has a single global minimum and no local minima. Once we graduate to real Neural Networks, though, that is no longer true. The cost surfaces are not convex and there can be lots of local optima.
One high level point to make is that convergence (even to a local minimum) is never guaranteed: if you pick a learning rate that is too high, you can get oscillation or even divergence. But that assumes you are using a fixed learning rate algorithm like the one Prof Ng has shown us here. There are more sophisticated versions of Gradient Descent that use adaptive techniques to control the learning rate.
But assuming you get convergence, you are correct that you may be at a local minimum. There is actually no practical way to tell whether the local minimum you found is close to the global minimum or not. This is only the first course in this series, so there are too many things to cover and Prof Ng does not go into much detail here about these issues but he will say more later. Here he just mentions that the “local minimum” problem actually turns out not to be that big a problem in general. The mathematics here gets pretty deep and is beyond the scope of this course, but here is a paper from Yann LeCun’s group which explains some mathematics showing that sufficiently complex neural networks have reasonable solutions even though loss surfaces are extremely complicated and non-convex.
The point about it being difficult to tell whether a local minimum is the global minimum is that the number of solutions that yield local minima is extremely large. Here is a thread which discusses “weight space symmetry”, which is the way to see that the solution space is extremely large. But the bigger point here is that actually finding the global minimum is probably not what you want in any case, since it would probably represent extreme overfitting on the training set. The other important point is that we don’t actually use the cost value J to assess the performance of a trained model: we use the actual prediction accuracy of the model on the training, cross validation and test datasets. Of course the cost function is critical in that the gradients of the cost function drive the back propagation process. So the function matters, but the actual J value is not really useful for anything other than as an easy proxy for whether your convergence is working or not. Another point is that the J values are not “portable” in the sense that just telling you what the J value is for one network does not make it comparable to another network and just knowing the J value doesn’t really tell you anything about the prediction accuracy by itself.
But the overall point is that the Yann LeCun paper shows that this is not really a problem for most of the deep networks we will use in practice.
Thank you so much for your answer.
Thank you so much for your response.
Thank you so much for your response.