Week 2 : Supervised Machine Learning: Regression and Classification

I have managed to complete week 2 but I am totally confused.

Why do we need to compute the gradient? The simple version would be

While loop →
Step A : Calculate cost for starting w and b value.
Step B : Calculate cost for starting w+alpha and b+alpha value.
Compare output between Step A and Step B and then decide whether to continue processing

Can someone please explain why gradient decent is needed? I see that we are multiplying alpha with gradiant decent and then deducting from w and b like below

w = w - alpha * dj_dw
b = b - alpha * dj_db

I guess my question is why not

w = w - alpha
b = b - alpha

1 Like

The gradients give the direction and magnitude that aims the cost “downhill” toward the minimum.

It is a good question. I just finished the first course and can’t stop appreciating the idea behind gradient descent. Simple yet so powerful I feel!

@TMosh nailed it, just to add to that just with the learning rate alpha you might overshoot (when alpha is too big) or take a very long time (when alpha is too small) to get to the lowest point (to converge). Because you are always deducting a constant from the parameters, but when you multiply the derivative it helps you with ‘rate of change’ (the size of step) that the algo needs to take when nearing the minimum.

You might have seen the size of steps get smaller as it approaches the minimum in the video and the optional labs. Attaching a screenshot as well.


On top of @darshN’s excellent example, the derivative also takes care of the direction. In the graph shared by @darshN, we see that b increased at first and then it decreased. This is not possible with just “minus alpha”.

Similarly, with just “minus alpha”, we expect w and b to only move in one direction, but gradient decent is supposed to work regardless of where w and b started off, so it can’t be a fixed direction, and the derivative saves us on that.



For the beginners like me if you are wondering what @TMosh is saying then…

took me quite some time to understand this intuitively and it is quite simple. any function mx+b or ax2+bx+c or anything else is straight or curved line and you want to select a point on the line to calculate the cost. if you do not calculate the gradient then the cost will calculated at points away and away from the line. (Which is what happens when alpha is too big).

I am not sure how to intuitively understand that in case of multiple variables but I am going to keep on thinking in the same direction.

1 Like

Hello @Harshawardhan_Deshpa,

Would you like to focus on the case of mx +b and step through the details of your idea?


A post was split to a new topic: Lost in week 2 lab assignment

You need to calculate cost at each point in the equation. For that to happen, you need to start at some point and then move up or down. I still dont know how intial starting point is calculated but once you have it. The way you do it is ititial starting point + (alpha * slope). Alpha*slope to make sure that you are staying on line while walking up or down. Imagine this visually and you will get it.


The initial weights are usually set to zero.

The slope comes from the gradients.

In general, follow the instructions in the notebook, and add your code where the “YOUR CODE HERE” banner appears within a code cell.

Thanks @Harshawardhan_Deshpa, and I agree with you! Btw, we usually initialize the starting point randomly, or zeros in some specific cases.


The reason why, based off my understanding, is to find the lowest J(w,b) aka cost function, which will give you the closest line of best fit.

It’s a mathematical tool used to help you find the best bias and weights of a function which then does the thing (thing being your desired output)

1 Like

Thanks for sharing, @Ryan_A, and it is a good way to understand it! We are setting it up as an “optimization problem” and the optimization is achieved by minimizing the cost.