So I am playing around with a data set I have to test my knowledge from week 3 of the machine learning specialization.
I am currently plotting a graph of the performance of my model with different amounts of training data. This idea is inspired by the following graph in the optional lab under “bias and variance” section.
I have a function which gets the data from my database according to how many data samples i want. My problem is, every once in a while, as you can see in the following image, the model gets a terrible r^2 score (high error) on both the training set and the validation set.
Here is the example in written, why does the model fail at 5000 training examples, when it works as expected on 4000 and 6000? The model has around 60 input parameters.
(the written example is not from the same run as the following image)
4000 training samples:
Training r2: 0.28
Testing r2: 0.14
5000 training samples:
Training r2: -0.04
Testing r2: -0.05
6000 training samples:
Training r2: 0.23
Testing r2: 0.18
EDIT:
Running it again, I only get one such error in my 30 different sizes of training data:
I believe that looks like a pretty healthy model except for the error at 5000 training samples. Feel free to give me other thoughts about my model! Do you agree that getting more data would not be helpful in this case? That I should probably try to make a more complex model?
Appreciate any feedback!