In the Skewed data sets section of Class 2, week 3 when precision/recall is discussed, I might have missed it, but I am curious which data is usually used to present this information, the training, cross validation, or the test data? It might be useful to look at this curve for the training and cv data while you are tuning the hyper parameters of the model (alpha, lambda, et cetera), but it seems like it would be more appropriate to look at and show this plot while using the predictions from the test data, after you have already tuned your model.
Question:
Is there a usual way this is done in the industry? Any thoughts are appreciated.
Thanks.
In general, the process is that you train on the training set, using the validation set results to adjust the training parameters, and then use the test set only for a final spot-check of the completed system performance.
You may present precision/recall values on test data to anyone including your company’s CEO.
You may present precision/recall values on cross validation data to your fellow data science colleague to discuss which candidate model is the best.
You may present and compare precision/recall values on training and cv data to decide how to tune your hyperparameters. However, it might not be a good practice to tune the probability threshold value with precision or recall because of the arguments here.