In this image when we fit the original J(w,b) using the regularized term, shouldn’t we be divivding the by -1/m(the total examples) rather than just -1/m(train) because we are already calculation the training error right ? Im a bit confused
We’re averaging the cost over only the members of the training set.
Is there any specific reason to do that ? If we average on all the examples of the dataset, what difference would that make ?
Thanks.
It would give a false indication of the training error.
Hello, @Subhan75, to give you an intuitive example, let’s say 10 apples are distributed to market A and market B. Market A gets 4 of them and sells them at a total price of 20 dollars, and B gets the remaining 6 and sells them at 24 dollars.
Now, if we want to compare the prices of apples in these two markets, the reasonable formula will be
\frac{\text{total price of apples sold in market A}}{\text{total number of apples sold in market A}}
In this ratio, both the numerator and denominator account for only apples in market A. The same would go for market B and we get the averaged prices of 5 dollars and 4 dollars respectively in Market A and B, concluding B selling them cheaper.
The similar idea applies here before we can compare the averaged errors of dataset A (training) and B (test):
\frac{\text{total error of samples in training set}}{\text{total number of samples in training set}}
Cheers
When we write the cost function as an average over the training set (i.e., sum of individual losses divided by m_train), we also want the regularization term to be on the same “per-example” scale. Otherwise, if we left the regularization term unscaled, its relative importance would grow (or shrink) with the number of training examples.
Thanks, this makes sense.
You are welcome Subhan75