Cost Function and Gradient Descent

I wasn’t completely sure if I understood the Linear Regression lecture correctly. Could anyone correct me if I am wrong, or could you please answer 4 and 5?

  1. In order to find the Linear Regression model, you adjust the values of w and b while seeking the function with the least error rate.

  2. This error is calculated using a Cost Function, mainly by computing the Squared Error Rate for comparison.

  3. To find the minimum J(w) value using Gradient Descent, you continuously modify the Learning Rate (alpha) and the derivatives of w and b, moving toward the minimum.

  4. (I’m not sure about this part) Do you use the Cost Function to compare the error values with each movement?

  5. At what exact point does the model training stop? Is it when the derivative values become zero?

You’re on the right track.

  1. Yup, the point of linear regression is to find the w and b values that return the smallest error/cost.

  2. Correct as well. A common cost function for linear regression is (mean) square error. However, there are different cost functions for different ML models.

  3. Mostly correct.

At this point of the course, it’s better to think of the learning rate as constant (not changing) during the entire learning process. It’s basically just a number you choose before training.

The technical way to put it is we calculate the derivative of the error (from the Cost Function) with respect to w and b. What does this mean? To put it simply, this derivative tells us how much the error would increase or decrease if you increase the parameters w by a small amount.

For example, if the derivative of the error with respect to w is 3, this means that if we add 0.01 to w, then the error would increase roughly by 0.03. If we were to subtract 0.01 from w, then the error would decrease roughly by 0.03. Obviously, in this case, we would want to subtract from w to decrease the error.

This is just a simple example. Normally, the derivative is a more complicated equation, and so we usually change w by a smaller amount at a time to make sure we aren’t causing the error to change too dramatically.

  1. I am not sure what you mean by comparing the error values. With that said, at each step of the training (or “movement”) where the parameters w and b are adjusted, we usually save the error computed by the Cost Function so we can examine it later (by plotting a graph to make sure that the error is really going down the more training we do).

  2. The model stops training whenever we want it to!

There is a variety of criteria for this, and you can choose which one to use depending on your situation. Some common ones include: 1) when the error decrease rate per training step is below a certain threshold, 2) after some fixed number of training steps, or 3) when we intuitively think it’s “good enough”, or 4) when you run out of time, patience, or computing resources, etc

For more complex models, it’s usually quite difficult (if not impossible) to get to the point where all of the derivative values are 0.

Hi @g471000,

I think @hackyon has explained everything, but just for the following one, if you are still not sure about it, please try to elaborate more your question. For example, what/where gave you an impression that there is any comparison?

Btw, although you began your questions by saying they were about the linear regression lectures, but through your questions, I somehow have a feeling that you have already gone beyond that. If you have read any other things that may contribute to your questions, please feel free to share with us too as they can help the discussion.


1 Like

Not exactly, we find the w and b values that give the minimum cost. “error rate” doesn’t really
apply to a linear (real number) output.

No, the learning rate is not continuously modified.

You may monitor the cost value to verify that the cost is decreasing, but this isn’t strictly necessary.

The derivatives will never reach exactly zero. You might stop training when the cost is no longer decreasing significantly. This is a non-specific threshold that you learn through experience.