Degree of polinomial vs regularization?

hi,
Professor NG clearly explains the importance of the bias and variance concepts in selecting the model for linear regression problems in the week 3 of course 2 of ML specialization.
So he provides two options.

  1. the first is establishing a handfull of models with different degrees of polynomials say from degree 1 to 10 and then chooses the model out of the cross validation errors with the least cost.
  2. the second method is taking a given degree of polynomial i.e 4 for a regression problem and then choosing the ‘just right’ regularization term through tests on the Cross Validation test.

but missing point is which method should we proceed when we choose our model. Should I arbitrarily choose a degree of polynomial for my model and then find a ‘just right’ regularization term? or should I adopt the first approach to come up with the best polynomial degree that neither over fits or underfits.
(I am not even talking about complicating the problem even more with adding more than one or many features in the model. This, I believe, will be dealt with in week 3-decision trees or I need to study feature engineering by myself. )

or maybe I should first apply the first step and then establish the best polynomial degree for the model and then proceed to explore the best regularization term applying the method 2. so best of the two worlds as they say it :wink:

or alternatively, call some magic function in scikit-learn to figure out all these issues and give me the best result :smiley:

Cheers friends,

Mehmet

Hello Mehmet!

You have suggested a few ways, and let me summarize them:

  1. Arbitrarily choose a degree and then look for the best \lambda
  2. Just go through approach (1)
  3. Go through approach (1) and then add the regularization and look for the best \lambda

I think all three ways are reasonable. Look, there is no single best way to do it, but the thing is to make sure you train a model that performs the best with respect to the cv dataset. Let’s discuss each of the 3 ways from the above:

Way no. 1

Let’s say we pick deg = 4, and then we train a model with regularization enabled, and we arbitrarily set \lambda = 0.1. Then we train the model with the training set, and evaluate the model with the cv set. Then we start asking the questions which Andrew had mentioned in the videos, and we come up with a conclusion of whether we are overfitting or we are underfitting. If we are overfitting, then we have 2 choices: we either reduce deg or we increase \lambda. Note that both ways can give you a very good model, but at the end of the day, the decision will be based on the performance on the cv dataset.

Way no.2

Let’s say we have results for deg from 1 to 10. Then we compare their performance on the same cv dataset, and find out deg=2 to be the best, then we start asking ourselves the questions which Andrew had mentioned in the videos, and we come up with a conclusion of whether we are overfitting or we are underfitting. If we are overfitting, then this time, we will enable the regularization and arbitrarily try a lambda > 0, then we see whether this will bring us a better model with respect to the cv dataset. This time we won’t need to try a smaller deg, because we had already tried it.

Way no.3

This is the same as Way no. 2 because in my discussion of the way no.2, we do not stop at the best degree model, and we will look into regularization if it overfits.

In summary, I want to tell you again that there is no single way for doing this. There is no best way. There is only your way which you have walked through to get to the best model. You may walk through this path, and I may walk through that path, and no one knows your final model will do better or mine will do better until both of us deliver our model for a final comparison.

If you do a round of many trials (like Way no. 2), then you will spend more time waiting for those results before you can start asking those questions which Andrew had mentioned to decide whether it is overfitting or underfitting,. If you do one model at a time (Way no. 1), then you don’t need to wait that long. Waiting is the difference.

Cheers,
Raymond

1 Like

thank you raymond. I understand that there is a bit of trial and error and even chance to come up with the best solution. lets say I follow the way no.1 to have a speedy good enough result but then I might have reached a sub-optimal solution than way no.2 but still the solution could be good enough given the baseline level of performance.

Interesting! seem like more experience gives more intution into what will work best!

It is a very good point indeed. I forgot to mention it. With way no. 1, we save a lot of time, but we also won’t get that “bigger picture” which the way no. 2 can give us. Very good point Mehmet.

Exactly! I am very glad you say this. Sometimes it looks magical that some people can be so confidence in adjusting those hyperparameters (like \lambda for regularization, or deg), but it’s all about experience and understanding of their underlying working machanisms. Sometimes experience is hard to fully describe…

Raymond

hi again,
as a follow up to this discussion, I have new additions to the previous points as well as new questions about resolving the issue of creating a ‘just right model’.

  1. Andrew establishes a good huerastic on how to proceed to establish a good model by a) adjusting the regularization term, b) adding more data c) simplifying or complicating the polynomial degrees or number of feautures depending on whether the model has high variance or bias problem. here is the slide

  2. but then introduces the NN framework and argues that with large enough network you can easily solve the problem of bias. and then regularize the model or add more data to adress the problem of high variance. one caveat is the high complexity of the computations making the process very slow. And the solition is high gpu&ram computers. here are the relevant slides.


Then here comes my question to Raymond or anyone interested with this thread.
Why should we lose time by adjusting reg. term, degree of pol. or finding more data with a simple ML model when in fact a complex enough NN model with reg. term fine tuning would do the work for us? of course presuming we have good enough hardware for the computationally expensive process.
so the usual question comes. Why to use a simple ML model with many issues to consider when we could be easily come up with a good model with a NN model?

I dont think I have discovered a very novel concept as I know that people still use simple ML models for tabular data.

I am just asking why not?

Mehmet

If you ask why not, then I feel safe to ask you to just go with NN.

Here are a few points that I want to make:

  1. I believe MLS introduces Linear Regression because (i) this is a Machine Learning course and not a Neural Network course, and as you said, Linear Regression is a very useful method, and (ii) Linear Regression is a very good entrace not just to the ML, but also to NN because a 1-neuron NN without activation function is just a Linear Regression, and concepts like regularization is common between them. So the MLS introduces it, and we learn it, and it is in our toolbox.

  2. NN is no less work than a Linear Regression (LR). You need to engineer features in LR, and you can choose not to engineer features in NN but choose to give it a large enough network. But that means you need time to figure out how large is large enough? And you need to balance the size with a good regularization. So you need time on trial-and-error with LR, and you also need time on trial-and-error with NN.

  3. Experience.

    • Your experience in feature engineering can be brought over to also engineer features for the NN. Feature engineering isn’t necessary given a large enough NN, but engineering feature might help avoid a very large NN, so if you are experienced in feature engineering, you are helping yourself.
    • You also need experience with NN to tune it well. Knowing well of LR won’t exempt you from practicing NN from the ground up. Also, there are so many creative ideas in NN that will ask you for even more time to learn and to practice.

However, I am not suggesting you to stay with LR, but if you are ready, perhaps you can start to try it and let us know whatever you find out :stuck_out_tongue:

Cheers,
Raymond

I see. F. engineering is usefull both for nn and LR as you underline it. but the good point is that you can discard it with NN provided that you have a large enough network which is also sth you need to consider.

I just think after watching andrew’s videos that NN architecture needs less effort for fine tuning than LR mode and that seems to be what Andrew implies at the end of the lecture l but we have a saying in turkish: ‘the sound of the drum feels good if you are away from it’. so maybe I should just approach the drum to have the real feel! :slight_smile: i.e. practice!

You know what, Andrew is very experienced. :grin: :grin: So, yes, have the real feel, and you can tell it. One day you may also say the same thing!

Do not neglect that training a logistic regression requires much less computation that training an NN. NN’s can be hundreds of times more expensive than logistic regression. This is due to the calculations required to perform backpropagation.

Depending on the problem you’re trying to solve, and the computing platform you are using, an NN may not be the best solution.

but if the data is not very large and you have good hardware, wouldnt using NN give a more accurate (a more easily formulated and accurate) model for a log regression problem?

I’m not sure exactly what you mean by “log regression”.

NN’s are very effective, but they’re not cheap or easy to train. I’m just suggesting there can be value in using the most efficient tool for each specific task.

I meant ‘logistic regression’ but I get your point. Thank you very much for your explanation

Hi everyone,
I think this topic and mentioned points are very interesting, and I would like to add another question/ point of view to it if I may :slight_smile:

In the beginning it was discussed complexity of model (e.g. low/high order polynomial) vs. regularization and what to use over which. The solution was that there is no “just one way” to do things and both approaches can give best results.

Next I will speak only about neural networks:
In the last assignment of Course 2/ week 3 we compared a simple and complex neural network with each other and found out, that the simple neural network was doing as good as the complex neural network (without regularization) because the complex one was overfitting. Of course this can be highly depended I assume.

My question is, why should I use a complex NN with regularization, if I could use a simpler NN without/ with less regularization? My point is, that simpler NN compute much faster. So why making a complex NN to then use regularization to make it simpler again if I can just go with a simpler and faster NN to compute?
So should I try to use a slightly less complex model with less regularization over a more complex model with higher regularization? is there a tradeoff or downside to it?

Hello @M_R2,

A very good question! Here is my opinion:

At the end of the day, we want a model that generalize very very well to the unknown data. This goal is governing.

If I end up finding a complex model with regularization that is NOT doing better (with respect to the above goal) than a simpler model, then I will go for a simpler model.

What more a complex model can bring to us is more neurons, and more combinations of neurons, and the CHANCE that after training, they will express a transformation that is better in describing the underlying, true transformation. Moreover, we also need to think about how we add those extra neurons. Do we just keep appending more Dense layer to it? Or do we have a strategy of how to regularize or use those layers. For example, in U-Net, we have those grey arrows (called skip connection) to bring over something from earlier layers to the latter layers.

In Inception Network, we have those “extra branches” (in green circles) to regulate the outputs from those layers in the middle

In GRU, there is this “memory cell” which is designed to be able to pass over things from previous time steps to some latter time steps

In Resnet, we see skip connections again but it is used to help learn the “residuals”


(Please watch this video to find out more about the two graphs in the bottom of the above slide. You may skip to 5:00 or, better, watch the whole.)

So it is not just that a complex model can give more possibilities, but it is a well-designed complex model I would say, they are on purposes, and that brings better results.

Cheers,
Raymond

2 Likes

Thank you for your detailed answer and the examples. I have not touched on those yet but it was very helpful to understand :slight_smile:

hi M_R2,
As far as I remember, Andrew suggests that if you decide to use NN architecture
e a complex model would be the better option as it eliminates the bias problem. Then you just need to regularize to save it from over-fitting. Yet the question remains as raymond posed in a previous post ‘how many nuerons and layes do we need for a complex model?’ his response to your question made me think that there is definetely more to nn than course 2 deals with and we should continue learning it by taking the dl specialization course. Course 2 was just a good start.

warmly, Mehmet

Yes, please just keep in mind that there are such things, and when the time comes, you will connect all the dots. :wink:

You will get to use the Tensorflow framework in this course, and please do try copy and paste a few more Dense layers and adjust some regularization parameters to see what happens. Be bold when adding layers.

Cheers,
Raymond

1 Like