Good day everyone , I am currently taking the first machine learing specialization course
and I have a question but it might sound funny, please when using gradient descent to optimize the cost function for a regression problem and it turns out that your cost is not reducing after 100’s of iterations, rather it moves up apart from the ways to solve it by reducing the learning rate , debugging can underfit the data with a wrong regression model maybe the data points are in a quadratic curve , which I was to use a polynomial regression model of the second order , but one uses a simple linear line model, can this cause your cost not to minimize and failing to make gradient descent reach global minimum? Thank you.
Hello @emmanuel_Obolo,
So you are asking whether fitting a second order data to a first order model will fail to reach the minimum. The answer is no, it won’t fail to.

Linear regression model y = w_1x_1 + w_2x_2 + ... plus squared loss guarantees you only one minimum which is the global minimum. You can reach it if you have a small enough learning rate and a enough number of iterations. This is independent of your dataset and your choices of features (be it linear or polynomial).

The cost space forms by 3 things  the loss function, data and model assumption. This means that, they are 2 different cost spaces when we have (A) squared loss +second order data + first order model, and (B) squared loss + second order data + second order model.

Since they are 2 different cost spaces, their global minimums are different, and carries different minimum cost values. Obviously, the global minimum of B is “cheaper” then that of A.

Because A, B both satisfy the condition set out in point (1), they will both reach their respective global minimum with a good learning rate and number of iterations.
If the cost stops decreasing but turns up after 100 iterations, I will first make sure I have normalized my data. If the problem remains, then I will guess whether my learning rate is too large. I will verify my guess by dividing it by 10, 100, or more, depending on what my current value is.
Cheers,
Raymond
What I am asking is that if I was to use an higher order polynomial and I used just linear regression, can this cause a problem of not making my cost fuñction to reduce and making gradient descent not to converge because we all know that using that model will cause underfitting.
Hi @emmanuel_Obolo ,
When using a higherorder polynomial with linear regression, you will certainly have more localminima, as well as a global minima. The exact shape of the cost function will depend on the specific data that is being fit, as well as the degree of the polynomial and the values of the coefficients.
In simple systems where you have very few dimensions, you would use full gradient descent, a simpler formula to reach optimization. In these simple models with 12 dimensions, you can get to local minima that are traps. It may happen.
Then we have the more complex models which involve perhaps millions of parameters. For this we would use Stochastic Gradient Descent (SGD) (and there’s another possible method called minibatch GD).
Stochastic Gradient (SGD), which is mainly used in complex NN, is unlikely going to get stocked in local minima because by nature it is very noisy. This noisiness may allow it sometimes to skip local minima. So you would say “hmm it is a matter of luck?”
Well, the real reason why NN can be optimized is that there aren’t that many local minima that are ‘traps’. The complex NN are built in such a way that the parameter space has such high dimensions that there are hardly any local minima.
When we humans imagine the functions in a graph, we usually think in 2D or may be 3D, and we can arguably say that there are high chances of local minima traps, as discussed in simpler models that use Full Gradient Descent.
In 3D, however, we may start gaining intuition that traplocalminima are rare. You’ll usually find the form of a saddle, where the apparent local minima can actually continue descending by one of the sides. From this intuition, try to imagine a complex neural network. These NN create such complex multi dimensional spaces (perhaps millions of dimensions or more that we cannot visualize), where local minima traps are rare.
There is also a great answer to this topic HERE provided not long ago by @paulinpaloalto  I invite you to read it to get more intuition on this matter.
Thought?
Juan
Hello @emmanuel_Obolo,
I think my point number 1 in my first reply has answered you. It does not cause divergence.
As for what can cause divergence, my last paragraph in that reply is about that: if the learning rate is bad, or the features are not all normalized, then it can diverge.
My point number 2 + 3 + 4 are discussing the difference between a first order model or a higher order model. They are different, only not in whether the cost will diverge or not.
I also recommend you to read the link shared by @Juan_Olano
Cheers,
Raymond