Hello, there is something I don’t understand in the explanations about ‘overfitting’.
I don’t understand how we can get a polynomial expression from the gradient descent algorithm. It is supposed that the gradient descent applied to a case with only one variable would allow us to find the values of w and b.
If I’m not wrong, we would never get anything other than a straight line.
In the video called ‘Addressing overfitting’ a polynomial expression of degree 4 is shown. However, I do not understand how we can arrive at such a result with the algorithm we have been taught (gradient descent). We would always get a polynomial of degree 1.
There are 2 steps to understand this. First, we need the multplie linear regression which is covered in course 1 week 2 and that it is about modeling with more than one feature.
Second, given that we can have more than one feature in a linear regression, even if we have got only one feature called x, we can manually create the second feature called x^2 by squaring our feature x, and similarly the third, forth, and fifth feature by calculating x^3, x^4, and x^5 respectively.
In this way, we have our original x, and self-created x^2, x^3, x^4, and x^5 and we use all five of them in our multiple linear regression y = b + w_1x + w_2x^2 + w_3x^3 + w_4x^4 + w_5x^5, or y = b + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 where x_n = x^n.
Review week 2 if you are unsure about multiple linear regression
The key thing to understand from Raymond’s explantion is that x, x^2, x^3, x^4 as features DID NOT come about as a result of Gradient Descent. These features were selected or created by us and then fed into the Gradient descent Learning Algorithm.
Given these features, Gradient descent then finds the optimal values of w1, w2, ...wn that serves as the coefficients for these features. If we choose to provide a different set of features with another combination of polynomials, then Gradient descent will find the weights for the new set of features provided by us.
So, the blame is on us for choosing a set of polynomial degrees as the features, which led to the wiggly shaped curve
It is a trial and error process. Having an awareness of the shape of a 2nd degree or 3rd degree polynomial would be good, But if the model needs to go beyond those degrees to be able to better fit the data, then there is no easy way out - Its a painful trial and error process! The alternative to this would be an advanced model like Neural networks, wherein we don’t need to do the feature engineering. The model will automatically find the optimal function that best fits the data.
This is the regularization term that we include in the cost function to try and reduce the impact of overfitting. This term tends to penalize the cost function for higher values of the weight values. Consequently, the weight values will be kept under control, curbing their tendency to assume higher values.