On bias, the lecture mentioned you can measure bias by comparing to human level performance, competing algorithms performance, or a guess based on experience. How would I apply this advice to a Kaggle competition? I have a machine learning model I created that I want to test for bias. How would I accomplish this?
Hey @Alexander_Leon,
In a Kaggle competition you would be having a dataset with you, and you can simply use the strategies discussed in the lecture video entitled “Model selection and training/cross validation/test sets”, i.e., to split your dataset into training, cross-validation and test sets, and then you can compare the performance on training and cross-validation sets, to find out whether your model is having high bias, high variance or both, as discussed in the lecture video entitled " Diagnosing bias and variance". Let me know if this helps.
Cheers,
Elemento
@Elemento I’m thinking about the “Establishing a baseline level of performance” video. Using that video as reference, if I split my Kaggle dataset into three and computed that the training error was 10.8% and my cross-validation error was 14.8%, then does this demonstrate high bias or high variance?
Hey @Alexander_Leon,
My bad. So, essentially your question revolves around “How to determine bias when there is no human-level (or baseline) performance available?”. There have been a great many discussions on this in the past, let me link a few of those:
- Bayes error, human-level performance and overfitting (structured data)
- Human Level Performance, how to set it?
- What to do when there is no human-level performance baseline?
- How to measure human level performance against human made labels?
- Approach when human-level performance not available
You will find that these posts share a great deal of knowledge regarding your query. Do check these out.
Now, when it comes to a Kaggle competition, there is a simple hack. We just check out the top scorer in the leaderboard Let’s say that he/she receives a 1% error, so we can simply establish this as the baseline performance (since we want to beat the top scorer), and if this is the case, you can easily determine that your model has high bias and considerable variance as well, and off-you go! Let me know if this helps.
Cheers,
Elemento