The lecture suggests to do error analysis on the misclassified samples in the cv set. I was thinking error analysis should be done on samples on the training set and then we can see if it improves the cv error as well. Is there a reason why doing error analysis on the cv set is preferred?
Hello @Rath!
Because comparing to the training set, the cv set is always the one that what your model should perform well because the better it is, the better your model is generalizing to unseen data. So when we evaluate how good a model does, we evaluate it on the cv set, and when the cv set doesn’t do well, we look at which samples from the cv set isn’t well predicted by the trained model.
it still seems like performing error analysis on training data should still be preferred over doing error analysis on cv data. because if we do error analysis on training data and then the new model improves the cv error, then we can be more confident that the model is doing a better job on unseen data. does that make sense? or is there something wrong with my logic?
Hello @Rath,
I think the whole logic flow is that. First, we evaluate our model on cv set (we don’t evaluate on training set), then if cv set doesn’t perform well, we can only locate the cv samples that are predicted wrongly (we cannot guess which training samples contribute to the wrong predictions without knowing which cv samples are predicted wrongly, agree?), then based on your analysis on those wrongly predicted cv samples, we start to list out some action items. If those wrong cv samples share something that is completely unseen in the training samples, then you may need to adjust your training set and in this case you can’t find out what’s missing in your training set without first looking at your cv samples.
Ofcourse there are something that you can do with your training samples without even testing your models such as removing/correcting wrongly labelled training samples, however error analysis motivates improvement by looking at the errors which requires you to test your model with cv first. On the other hand, we can always improve our training data’s labels anytime.
Model improvement needs us to train a new model and everything related to the training process can be under consideration for improving, including model architecture, model hyper parameters and training dataset. Therefore, by examining the cv set we are not ignoring the training set, but we need to know what kind of unseen data (cv data) our model is not good at at first.
If you sell your model to a customer after you trained it well and are very confident about it, and one day your customer tells you your model doesn’t work right on the customer’s dataset, I believe you will want to look at the customer’s dataset right? That customer’s dataset is like the cv dataset because that dataset is unseen by your model.
Lastly, if we train our model, then without testing it with cv we move to analyze our problematic training samples in the hope that the improvements from there will make cv performs better, then I would say it’s not impossible but it’s not efficient. If we assess our model with cv, why don’t we guide ourselves with those problematic cv samples? One worst case is we have a bunch of training samples that don’t do well but they indeed represent data that’s completely unneeded by your customer, then we can end up wasting our time.
Cheers!
That makes a lot more sense. Thanks for the explanation.