DLS C3W2 3rd Lecture notes

I am not able to understand how can we get this distribution

Human Error - 4
Trainset error - 7
Train-dev set error - 10
Dev error - 6
Test Error - 6

Specifically, I am having a hard time understanding the difference between Train-dev set and the Dev set.

Edit: Thank you sir for your advice, though I exactly do that. What I meant from the last sentence was that how the test-dev and dev sets are different in this case. What scenario could have caused this distribution of errors?

1 Like

It’s probably worth watching the relevant lectures again. Prof Ng does explain all this in some detail. The point is that in cases in which the distribution of the training set and the dev/test set are different, you can gain some advantage by further subdividing the training set to provide a smaller “dev” set that matches the distribution of the training set. That is the set he calls the “train-dev set”.

There is no shame in having to watch portions of the lectures multiple times. I frequently stop the lecture and then start it again a minute earlier to watch a given section again if I don’t “get it” the first time through. Prof Ng does a great job of explaining all this stuff, but any of us can need a couple of times hearing the explanation to fully understand all the points he is making.

I also find that it really helps to “take notes” the old fashioned way with pencil and paper. I find it really helps me “hear” everything if I try to actually write down a summary of the points that Prof Ng has made. And frequently that requires playing the clip multiple times to get all the subtleties.

Thank you sir for your advice, though I exactly do that. What I meant from the last sentence was that how the test-dev and dev sets are different in this case. What scenario could have caused this distribution of errors?

That’s not quite what you said in the original post, but I think I get your point. Prof Ng did explain this in that section as he was giving lots of different examples of how things can work out. The particular pattern of errors you see there can occur in a case in which the dev/test sets are from a different distribution than the training set and it just happens that the dev/test data is just easier for the algorithm to deal with than the training data. So that pattern means you have underfitting on the train-dev data, but the algorithm actually does better than that on the dev and test data. So you probably need a more complex model because it also is doing less well on the training data than the dev/test data. So it’s not the fact that you are overfitting the training data relative to the train-dev data that is the primary thing to attack in this case. First you have to do better on the training data and the train-dev data, which requires more variance, right? Then see what happens. You hope that lowering the train-dev error should not have a bad effect on the performance of the algorithm on the dev/test data. But as always, you won’t know until you run the experiment.

I think I got the most of it. I got that we have to reduce the variance.

  1. But why to opt for a more complex model:
    The training data already has complex training examples that’s why it does poorly on the train set than the dev/test sets. By underfitting on train-set, we have found less dev-set error. If we are sure that we will be working on the dev/test distribution, then do we need a complex model? (Put in other words, we have got what we needed - good accuracy on dev/test set, why train the model further and reduce train-set error which could increase dev/test error?)

  2. Does it mean that to do better, we must add more data similar to dev set distribution to the train set?

  3. Also, throughout the course, I have heard the terms easier data. But what do we mean by easier data? If I have 2 datasets (say of cat images) how can I say which data is easier?

The point is that the model does not do very well on the training data. It has too high a “bias”. So how do you remedy that? By modifying the model to have more “variance”. So how do you accomplish that? By adding more complexity to the model, right?

For question 3), you can tell that the data is “easier” by the performance of the model. In this case it does better on the dev/test data than on the training data. So by definition that means the dev/test data is “easier”. “Easier” is not defined by looking at the data or by your personal “feelings” about it: it is defined by the performance of your model on the data.