Interesting predictions…
Interesting results! Thanks for showing us what happens with your model! It is especially cool that you showed us the actual 64 x 64 images that the algorithm is really “seeing”. BTW I assume this is the Logistic Regression model in Week 2. You can try again with the 4 layer model from Week 4 when you get there.
Even in that case, we don’t really get very good generalizable performance. My guess is that the training set here is tiny compared to the real complexity of this task. 209 training images is really a small dataset, so it’s actually pretty surprising that it works as well as it does. I think the training and test data are pretty carefully “curated”. There was a big thread in the old version of these courses where the dataset was examined. I’ll see if I can bring that information forward to Discourse, but one of the observations is that the training set is skewed towards non-cats and the test set is skewed towards cats. We tried “rebalancing” things a bit and everything we tried made the results worse, which was the basis for my supposition about the training data being very carefully curated here.
What do you mean by “skewed” towards cats/non-cats? That there are more non-cats in the training set and more cats in the test set? I’d interested in seeing that thread, especially if it also contains a discussion on why and how this asymmetry affects the performance of the learning algorithm. Ideally, do we want equal numbers of positive and negative examples, or do we want them to be just more than enough individually (given the model’s complexity) and IID representatives of their respective classes?
Yes, it’s easy to assess how many cats there are in each dataset from the labels:
print(f"mean(train_set_y) = {np.mean(train_set_y)}")
print(f"mean(test_set_y) = {np.mean(test_set_y)}")
mean(train_set_y) = 0.3444976076555024
mean(test_set_y) = 0.66
So you can see that the training set is 34% cats and 66% non-cats, whereas the test set is 66% cats. There is an interesting reverse symmetry there, but I do not know if that is significant. My guess is that there is no such mathematical requirement and everything we see here is very “ad hoc” and specific to the particular case at hand. The training set has 209 entries and the test set has 50 and these are (as I commented above) extremely small numbers for a problem this complex.
Your questions are really important ones, but I don’t know if there is any such general answer about the balance of positive and negative samples. If I had to place a bet, I would bet that there is no general rule and it’s all situational. I will try to find time to bring over the threads from the previous forum, but I can’t guarantee that I can get to that today. When I do, I’ll probably create a new thread and give a link to it on this thread.
Hi, Tolga.
I’m sorry that it took me so long to get to it, but I finally created the thread I was hoping to that has some deeper analysis of the balance of the dataset here and runs some experiments with that.