Gradient Descent - Pls explain how it knows which direction from top of the hill

I am revising this section after completing Part 1and II and have kinda understood the nature of MSE cost function to be an inverse bell curve, I understand how a Gradient descent formula can find its minima minima.

However in a 3D hill, how does the G.D formula understands which direction of the 360 degrees to take. From what I could visualize , it just takes any direction and keeps going down.

Can you pls explain me this section of video from 4:58 - 5:24?

Thank You,
Venkat S

Time : 4:58 - 5:24
Gradient descent | Coursera

Your goal is to start up here and get

to the bottom of one of

these valleys as efficiently as possible.

What the gradient descent algorithm does is,

you’re going to spin around 360 degrees

and look around and ask yourself,

if I were to take

a tiny little baby step in one direction,

and I want to go downhill as quickly

as possible to or one of these valleys.

What direction do I choose to take that baby step?

Well, if you want to walk down

this hill as efficiently as possible,

it turns out that if you’re standing

at this point in the hill and you look around,

you will notice that the best direction to take

your next step downhill is roughly that direction.

Mathematically, this is

the direction of steepest descent.

It means that when you take a tiny baby little step,

this takes you downhill faster than

a tiny little baby step you could

have taken in any other direction.

After taking this first step,

you’re now at this point on the hill over here


Gradient descent works best on convex cost functions. This means the cost isn’t a hill at all, it’s a bowl that is infinitely tall. No matter where you start, the gradients will always point toward the bottom.

If you’re using a non-convex cost function, then there can exist a local maxima (the “top of a hill”). If you’re perfectly and exactly at the top of the hill, the gradients will all be zero, and the solution will never move off that point.

In this case, the best solution is to pick a different initial weight value, and hope it isn’t also perfectly on the top of a different hill.

In practice, this situation is vanishingly rare, and not really a consideration to worry about.


Hi TMosh,
Thanks for your prompt response. I am with you , I get what you are saying.

The below is the part I am confused about. Andrew mentioned earlier that G.D can do cost reduction for any function including MSE. I though the golf hill course, is he still talking about MSE?

If so, then how does MSE know about direction?
Andrew mentions this at 5:10:
" Mathematically, this is the direction of steepest descent. It means that when you take a tiny baby little step, this takes you downhill faster than a tiny little baby step you could have taken in any other direction."


The gradients are given by the partial derivatives of the cost equation with respect to each weight.

If the equation is convex (as it is for MSE - used for linear regression - or the logistic regression cost equation), then the gradients lead toward the minimum cost.

If the cost equation is not convex, then the gradients will still lead toward a minimum, but it may not be a global minimum.

Andrew’s explanation tries to minimize the need to understand calculus.

1 Like

Ah, its partial derivative, so it will be a bell curve for that variable.
Am I correct?
Thank you so much TMosh

1 Like

I have an additional question to this initial question by OP.

I have a non-convex cost function, that is, a function with many local minima. And one global minima.

How can I choose the initial value of W and b such that I can reach the global minima ? My goal is not to choose W and b such that I get stuck in a local minima. Is there a way to do this ? Or the choice of initial w and b is random and we just have to trust our luck ?

Or should we say that in the case on non-convex functions, gradient descent algorithm is not the best approach?

1 Like

You can’t pick the best initial values.

But you can train with several different initial values, and choose the one that gives the lowest cost.

Or you can accept any minimum for which the test set cost is good enough.

1 Like

Yes that makes sense. Thank you for the answer.

The partial derivative defines the direction in which that variable moves the fastest.
And the sign before the partial derivative means go downhill or uphill.
For G.D, for example w = w - learning_rate * partial_derivative_of_w , the minus sign means go downhill.

1 Like

How does your definition work when the gradients are negative?

1 Like

The direction downhill or uphill has nothing to do with whether the gradients are positive or negative. To be more precise, I should use the term “direction of decrease” to refer to the minus sign in the formula w=w-learning_rate * partial_derivative_of_w.
In the gradient descent algorithm, we assume that the initial value of w is not at the local minimal point where the gradient is close to zero. Since the partial derivative of w always follows the same direction as w, the minus sign means to decrease the value of w.

1 Like

The important factor is what is the sign of the change in the weights (dw/dj) that will make the cost decrease.

1 Like

Yes, that’s true.