Contradiction in Specialisation Materials

In course 2 week 1, Andrew explains how to split the dataset for training and evaluation. He warns about mismatched training and test set distributions though.

  • training dist != test dist (is not okay, they must be i.i.d)

The example he gives is related to the cat classifier and talks about mismatched distributions i.e. the training set is based on internet images, while the test set is based on user uploads to the app. So the answer is to pull them together, randomly shuffle and then split–now the distributions across sets are i.i.d (identically distributed and independent).

Make sense: one is learning the data distribution with the network, so to be useful, the learned distribution from training should be related to the distribution on tests the classifier on (were we successful at learning the distribution). If one didn’t randomly shuffle and then use the images as training and test, the classifier wouldn’t perform well, because one is testing performance on data from distributions that the classifier has been trained on. This is true, I’ve tried it out… Maximum Liklihood principles also speak to this point.

The most important point Andrew stresses is that whatever one does, the eval sets (dev and test set) come from the same distribution–test set now becomes not a single set but two eval sets: dev and test. So, given the above logical constraint, this implies for iid assumptions to hold, ergo:

  • training dist == dev dist == test dist (i.i.d)

But then, it’s sometimes okay to have just a dev set with a different distribution to the training set, and not have a test set altogether (although he personally advises against this). So,

  • dev dist != training dist (this is okay now but contradicts the iid assumption)

  • training dist != dev dist == test dist could also be true AND THEN training dist != test dist (a logical contradiction).

In course 3 week 1, the feedback from the assessment clearly gives the following logic:

  • training dist ! = dev dist (this is okay, we can tune the hyper parameters using the dev dist (setting up the target for our test set predictions)). The example is the citizens bring their own images (social media (1,000,000) and images taken with their own cameras (1000)). In this first question, the grader says adding the former to the training set can improve performance on the eval sets, but the problem is that the logic is:

  • training dist != dev dist AND NOT training dist == dev dist == test dist (i.i.d).

  • dev dist == test dist (this is okay, because we’ve tuned the classifier to predict on the test dist, i.e. dev and test are iid (these were from the original 10,000,000 images given by the project team, m_dev=250 000 and m_test=250 000). Given these logical constraints, then the following is true (although not stated):

  • training dist != test dist (this is okay then. The problem is that this runs contrary to the logic based on mismatched training and test distribution made by Andrew earlier (see the first logical statement).

Please can you clarify. All the text books say the split must ensure the the training and eval sets are iid. I can image in certian circumstances this may work, if the distributions are close (in terms of cross-entropy measures) but not if they are entirely different.

  • training dist ~= eval dist (but only when cross-entropy between training dist and eval dist is small)

Am I missing something?

1 Like

Hi Matt,

This is an interesting question! The course didn’t touch this aspect directly, but yes, there were a lot of talk about cross-validation and we can generalize and derive our assumption after going through the materials we were provided. Yes, a little web search could clarify our understanding about the subject even more. Here are a couple of links below, you could delve in and read further about the distribution of the complete data, the algorithm can work on. If you face any doubt, we all can have a nice discussion at this platform as it’s really an interesting topic :slight_smile:

  1. Machine Learning - Lecture 5: Cross-validation (This will give you an idea about iid assumption)

  2. Significance of I.I.D in Machine Learning | by Sundaresh Chandran | DataDrivenInvestor (this link will bring you the significance of iid in machine learning)

Happy Learning!