Didn’t understand the sentence in explanation of why ‘J(theta)’ theta with varying degrees of polynomial of ‘w’ w.r.t test set is a flawed approach…explanation made was ’ Certain degree polynomial being overly optimistic of the generalization error, this approach is flawed’

Any help in breaking down this in simpler english much appreciated.
Thanks

This is called Overfitting, it means the algorithm is really good at predicting data it has already seen and been provided labels for, but new data it will categorize incorrectly because it was following the test data too strictly. The model has learned patterns that may not apply to future test data.

So by reducing the complexity you might make some mistakes during training, but you’ll end up being more accurate with data you haven’t seen yet.

I’m just making up numbers but this would be like training the model to be 100% accurate during testing, but it’s only 50% accurate when you’re making new predictions. But using a lower polynomial you might only achieve 95% accuracy during training. Then when you throw new data at it you’re correct 89% of the time. The second condition is more desireable.

Here is a visual example for overfitting: you can see that the overfitting model on the right is just too complex and it is oscillating too much in certain areas. These are the areas the model does not generalise sufficiently and might cause big issues:

Thanks
High Bias, High variance very nicely represented!
In week3 MLC2 the natural progression of how Prof Andrew adapted to explain Model selection was:

Dividing data to training & test. Then he explains why this might be flawed and he uses the statement over optimistic in generalization error term. @THIS point., I couldn’t wrap my head around. No concept of validation set introduced yet.

Going further from 1 above, he proposes training, validation/dev, test sets. This I understood clearly, as to how selection criteria is based on ‘minimal Jvalidation’ from the candidate models applied on devsets and then further applying on test.

I found this answer in the stackexchange (screenshot atatched) which completed the answer for me. May serve for anyone with same question in the future.
Best

Yes. He is normally very clear but this explanation was not one of his best. I think the point is that since you are using the test set to decide on the optimal value of d, by definition/construction the result will be small for the test set and not indicative of performance for other data.

Some other sample of the data, such as the test set, will be very unlikely to have a lower value of J. So the value of J from the “training set” is generally going to be smaller (“more optimistic” in his words) than the test set. And that is the flaw. The value of J from the training set will not be indicative of the values for some other set of data. By extension, the value of d may also not be the best. As a result, the selected model may have a lower predictive value when you go to actually use your model.

Hi all,
i have the exact same question @tennis_geek but i still dont get it even after i read the stack overflow screen shot.

I understand that the overfitting/underfitting concept; and i also understand that the Jtrain(w,b) would be overly optimistic than general error as the model was only trained on training set

however, the testing set is separated from training set. The model did not “see” the testing test. In this case, why Jtest(w5,b5) is still overly optimistic than general error?
What did i miss here?
thank you all in advance.

Hi @Lanying_Ma
Simplest way I absorbed the concept of train, cv/dev and test is as follows:
The AI model we create shall be fit for the training set. If the predictions are very good…we shouldn’t relax as this is only one data set we trained our model on.

So, we take another different dataset(CV), tweak the model in any one or a combo or all (hidden nodes, hidden layers, regularisation parameter etc) of these properties and re-fit. (Why should we play with tweaking the model is a whole different answer but keeping it minimal in this context).
Select the best model again which produces least Jcv(cost function on cv dataset).

Finally, the test set is the dataset…in a way we are putting this best model selected…to real examination for the very first time to see how it fares… So, we don’t fit our model to test set any more but still we are bound to assess the model performance by the cost function computed on this test set.

So summary:- fit model on train set ----refit on cv set ----simulate predictions on test set ( don’t fit on test set)

Thank you so much for your reply. it is very straightforward. What I missed was: i think the “tweaking the model” was actually “training” a new model. so after all the tweaking/training all models >> choose the best >> estimate the Jtest.
much appreciated.

Hi all, I found this thread very helpful. I came here related to this very topic and there is something (at much higher level) that I don’t truly grasp.

Regarding how to choose the polynomial order to use to actually select a Model, there are different examples for it: d=1, d=2, d=3…
However, the number of parameters of the model (weigth, bias) is not strictly related to the number of features in the dataset?

Let’s say your training data has 7 features. Wouldn’t you need to work with a d=7 model? Does this chapter talk about the fact that you might decide to implement a model which takes in account less features, therefore reducing the number of w, b parameters?

And of course, the other way around, if you are working with house price prediction, but your training set has only 1 feature (let’s say size), aren’t you limited in this case to d=1?

I think we are defining d as the order of polynomial feature. For example, for a set of 2 features, having d = 3 means the final set of features to be x_1, x_2, x_1^2, x_1x_2, x_2^2, x_1^3, x_2^3, x_1x_2^2, x_1^2x_2.

Let me lay down some foundational points:

The value for d has nothing to do with the number of features we have in the original dataset. We can have 7 features but setting d = 1 or d=10. At the beginning of my reply, I have shown how we have d=3 when we have 2 features in the original dataset.

There is no rule that dictates what the value of d should be, given the number of features. Don’t look for such rule.

It is our own job to determine what d should be, and experimentation can tell you whether d=1 is better than d=2.

The number of weights is related to the number of features in the final dataset. With the example at the beginning where initial number of features is 2 and d=3, there will be 9 features in the final dataset, and consequently, doing a linear regression with such final dataset will have 9 weights each attaching to one of the final features.

If you have 2 features at the beginning, it is your decision to whether engineer polyomial features into your dataset. This is a very important point to keep in mind. It is a tool that we choose to use it or not, it is not a apply-it-anyway kind of thing, and it is not a guarantee of anything.

Thanks a lot of taking the time to provide such a detailed answer, I really appreciate it! My train of thought went for a completely wrong path xD.

I understand better now. However, I think I missed where in the course Pr Andrew goes through this, or at least I didn’t understand how engineering polynomial features ties to the big picture.

Could you please point me out the chapter(s) where this is being explained, or any resource that you consider is relatively easy to grasp for beginners regarding this topic?

The discussions about feature engineering and polynomial features are in Course 1 Week 2, and there is a lab about it too.

If I may, I would like to give you one more suggestion that I found very helpful during my learning.

To begin with, you probably wouldn’t find any lecture material that would tell you that “d is not equal to the number of features”, and it took some chance for you to actually see an example from any lecture to show you that d can be larger than the number of feature.

But we can always have some misunderstanding when learning. I think this is how we learn, we don’t learn 100% efficient. There is nothing wrong about it. The thing is, it sometimes really takes luck to find some lecture or discussion that will hit the spot. This can happen over and over again in the future.

My suggestion is, give it a try whenever possible, for example, with this method. We may create a simple dataset of 1 sample and 2 features using simple integers, and try whatever d value and see what happens next.

I know what I have said above is not always possible, and even it is possible, sometimes we just do not know how to find a method like the one I have put a link to. However, give that approach a try for at least 20 minutes every time you come across something would slowly help us be more capcable of making the approach possible for us. For example, googling “how to practice polynomial feature engineering” will give us something to start with.