In the learning rate video, an example of a non linear function is given, regarding if we take a point which is already in a local minima. However, I was wonderong, if we take a point in a local maxima for w, will the gradient descent still work? as the slope at that point will be 0, and w will remain the same, so it won’t go towards the local minima. What is the solution to this?

If you are already at a local minimum, then you’re done, right? The question is just how you realize that. It’s always a decision that you have to make when to stop the iterations. One common way is to monitor whether the cost value is continuing to decrease or not. When it stabilizes and stops decreasing, then there is no point in continuing, which is what would happen if you exactly land on a point where the gradients are zero.

Of course it can also happen that you slightly overshoot and then the cost can oscillate around a low value. But if you are using too high a learning rate value, the oscillations can get larger instead of staying close to the minimum value. So you have work to do here to pick a good learning rate value and a good number of iterations.

If you continue through MLS and eventually take DLS, you will later learn about more sophisticated algorithms for doing gradient descent that use dynamic learning rates.

Thanks for your answer! I guess you misinterpretted my question. It was, that if we take our inital point as the local maxima, then we won’t be able to reduce the cost function, and reach the local minima, as the slope is 0 at the local maxima. Then how should we tackle this problem.

Choose a different initial point. Start over with random initialization with a different seed value or with no seed value at all. There is never any guarantee that gradient descent is just going to work with no effort on your part. You need to analyze the results and take action when it doesn’t work.

Also note that just as a general matter the probability of starting with a random initial point that just happens to be a local maximum or local minimum or for that matter a saddle point (another point with zero gradients) is pretty low.

At this point in the course, we’re only using simple convex cost functions. They’re shaped like parabolas with positive 2nd derivatives.

So they have only one minimum, and no local maxima.

Got it, thanks!