Hi everyone, I’ve just started recently to learn about ML, since we have an ML class at college, I’m studying computer science.
so I understand the GD algorithm and concept of the cost function (MSE) pretty well, I guess, but I’ve come across two different definitions of each equation respectively, which had me kind of confused about the meaning of that
so there’s this one here which was the first one I’ve come across and understand
As the second diagram has no reference to the terms used in the formula, and I cannot find it in the lecture video, so I assume this is what it means:
E represents error, so here, the formula is to calculate the mean error value, where n is the number of examples. y_i is the true label (mx_i +c) is the prediction where m is the same as h_θ. But with this formula, the bias term is expressed in C. Whilst in the first formula, the bias term could be the first element of the h_θ vector if used. The footnote is telling us this term (mx_i +c) is the ŷ_i. The square of the resulting term take care of any negative value.
So basically, the two formulae are the doing the same thing, calculating the cost, ie, the error.
If you go back to the lecture video - Cost function formula, Prof, did mention about that, and he said he would explain it later on. If you think of cost as an indication of how well the model is finding the values for parameters W and b, weights and bias, then it is the downward trend that matters more than just the pure value.
Adding to @Kic clear answer, I would include that using 1/n or 1/2n is valid and a choice you can make. This is a term that will basically scale the cost to the number of samples.
The important thing is to be consistent across the entire model. If you pick 1/n or 1/2n, make sure you are consistent with your choice.
As Kin and Juan have explained, which constant factor you choose doesn’t really matter in terms of the final solution you get: if you minimize E, you have also minimized \frac{1}{2}E and the other way around. The one other thing it might be worth mentioning is why a lot of people prefer to add the factor of \frac{1}{2} here: the very next step will be to take the derivative of E in order to compute the gradients for back propagation. Notice that the error terms are squared there, so taking the derivative will give you a factor of 2, right? If we have:
f(x) = x^2
Then
f'(x) = 2x
So the \frac {1}{2} will cancel that and just make the formulas for the gradients a bit simpler and cleaner. As mentioned above, it gives the same final answer either way, so why not optimize for the simpler gradient formulas? Those are what we actually use in terms of writing the code.