In linear regression, or generally if applicable, why is it important that the cost function doesn’t grow in unit proportion with the training data size (i.e., why the average squared error instead of the total squared error)?
It’s handy to have a normalized cost value, in case we want to assess changes in the model fit with regard to the size of the training set.
This is called a “learning curve”, and it’s used to assess whether the training set is large enough to give good performance.