Gradient Descent confusion

Hi everyone, I’ve just started recently to learn about ML, since we have an ML class at college, I’m studying computer science.

so I understand the GD algorithm and concept of the cost function (MSE) pretty well, I guess, but I’ve come across two different definitions of each equation respectively, which had me kind of confused about the meaning of that

image

so there’s this one here which was the first one I’ve come across and understand

but now there’s this one here as well

and since each one of them actually results in a different derivation
I can’t quite grasp the difference between both of them

so can any one please help me clarify this.

Hi @Mahmoud_Mohamed4 ,

As the second diagram has no reference to the terms used in the formula, and I cannot find it in the lecture video, so I assume this is what it means:

E represents error, so here, the formula is to calculate the mean error value, where n is the number of examples.
y_i is the true label
(mx_i +c) is the prediction where m is the same as h_θ. But with this formula, the bias term is expressed in C. Whilst in the first formula, the bias term could be the first element of the h_θ vector if used. The footnote is telling us this term (mx_i +c) is the ŷ_i. The square of the resulting term take care of any negative value.

So basically, the two formulae are the doing the same thing, calculating the cost, ie, the error.

Yeah, I get you
But why is the 1st equation multiplied by 1/2m, while the 2nd is multiplied by 1/n?
does it have any effect on the results at all?

Hi @Mahmoud_Mohamed4 ,

If you go back to the lecture video - Cost function formula, Prof, did mention about that, and he said he would explain it later on. If you think of cost as an indication of how well the model is finding the values for parameters W and b, weights and bias, then it is the downward trend that matters more than just the pure value.

in which video was that part mentioned please

At timestamp 7:07

Cost function formula | Coursera

2 Likes

Adding to @Kic clear answer, I would include that using 1/n or 1/2n is valid and a choice you can make. This is a term that will basically scale the cost to the number of samples.

The important thing is to be consistent across the entire model. If you pick 1/n or 1/2n, make sure you are consistent with your choice.

1 Like

@Juan_Olano and @Kic
Thank you guys for the help

1 Like

As Kin and Juan have explained, which constant factor you choose doesn’t really matter in terms of the final solution you get: if you minimize E, you have also minimized \frac{1}{2}E and the other way around. The one other thing it might be worth mentioning is why a lot of people prefer to add the factor of \frac{1}{2} here: the very next step will be to take the derivative of E in order to compute the gradients for back propagation. Notice that the error terms are squared there, so taking the derivative will give you a factor of 2, right? If we have:

f(x) = x^2

Then

f'(x) = 2x

So the \frac {1}{2} will cancel that and just make the formulas for the gradients a bit simpler and cleaner. As mentioned above, it gives the same final answer either way, so why not optimize for the simpler gradient formulas? Those are what we actually use in terms of writing the code.

2 Likes

Thank you for the clear explanation