How can we use the Dev set / cross validation set to eliminate bias?
Hi Max,
I am assuming you are referring to bias/variance trade-off. Dev-set and cross validation techniques are used to control overfitting. When a model is over-fit, it leads to higher degree of variance as compared to bias. The normal training process is enough to reduce the bias and cross-validation ensures we don’t reduce bias too much at the cost of variance, there’s always a tradeoff.
Deep Learning / ML models are function approximators, they’ll try to approximate the underlying function generating these results. Since the result is an approximation, there is a trade-off. Only way to eliminate bias without incurring the penalty on variance is to figure out what the underlying function is.
Thanks for the response. So, if our model does not have any hyperparameters, we still want to use a dev set to control overfitting? How exactly does a dev set and/or cross validation control overfitting?
Maybe I’m missing your point, but there is no such thing as a DL model with no hyperparameters, right? No matter what you have to at least choose the number of layers and the number of neurons in each layer if it’s a FC network. And the hidden layer activations. And the optimization algorithm (GD, SGD, minibatch, Adam …).
I think in your original question and this one maybe it’s just a question of wording, but you don’t use the dev set to “control overfitting” or “eliminate bias”. You use the dev set to detect overfitting or recognize excess bias. The “control” or “elimination” is by what you then change about your model to remedy the problem that you see because of using the dev set.
What I meant to say was if we want to keep our hyperparameters fixed. I’m not totally sure if that scenario would make sense but I was trying to figure out what the other reason for using the dev set would be besides selecting the best model. So you answered my question - it’s to detect our model’s problem so we can try to improve before moving on to the test set. A follow up question I have is : why can’t we just use the test set to improve our model? I mean, technically the dev set is “unseen” data every time we make a new model with different hyperparameters. So when we choose the best model we’re choosing the best model that performed on unseen data, why would we need to repeat that again on a test set ( the dev set could just be the test set)?
The point is that once you finish the refinement cycle using the training vs dev set, then you need an unbiased evaluation of how your model does on data that it’s never “seen” before. If you only have the training and test sets to use, then how do you get an honest evaluation of the final performance of your tuned model? It was tuned on the “test set” in your scenario. There needs to be a separate set of data that is not used in the training process, right? You can change the names, but that doesn’t change the procedure. E.g. you could call the dev set the “test set” and then have another set called “real world data” or call it the “Fred and Barney set”, but the names don’t matter.
And saying “if we want to keep our hyperparameters fixed” is a pretty limiting scenario in general, right? You’d be saying how do I change my model’s performance without changing my model in any way. The only scenario I can think of for that would be if you have overfitting and decide to try adding more training data. That is one of the alternatives that Prof Ng lists for the “overfitting” case, but he points out that is not always easy or cheap to do.
I think this might have been what I was trying to get at with my original question. Using a dev set “eliminates” bias [from our evaluation] because if we only used a test set, our evaluation would be biased.
To your second response, maybe a scenario that I was thinking of would be just training a linear regression model without deep learning. But that would still have hyperparameters (optimization algorithms, etc)
In your first paragraph, I think you are using “bias” in the general sense that it is used in English as a natural language, not the technical definition of “bias” as the opposite of “variance” in Prof Ng’s nomenclature. Maybe that’s ok …
Linear Regression is a pretty limited case and it’s not clear there is anything we can learn for our more complex purposes here from that example. Literally the only thing you can change is the data. There is a closed form optimal solution given by the solution of the Normal Equation, right? Of course one way you could change your data is by doing polynomial expansion on it, which opens up a space of hyperparameter choices.
You’re right I was definitely using it in the general sense which might have caused some earlier confusion. And I forgot that linear regression has a closed form solution so we technically don’t need any optimization algorithms. Thanks for your help