When there are multiple local minima?

JPGuittard · May 26, 2023, 5:26pm

In some 3D plots there can be multiple local minima.
If our goal is to minimize the cost function, I assume that means we want to end up at the bottom of the deepest local minima. But how do we do that? As we begin our gradient descent, how do we ensure that we are heading towards the deepest valley?

TMosh · May 26, 2023, 5:30pm

If you’re using linear regression or logistic regression, those cost functions are known to be convex. They have only one minimum.

For more complex systems that use neural networks, those can have local minima. In that case we’re often satisfied to find a “good enough” solution, rather than try to find the solution that has the lowest minimum (if the lowest minimum doesn’t give significantly better predictions than the ‘good enough’ one). It depends on what problem you’re solving and what your goals are.

TMosh · May 26, 2023, 5:31pm

Keep in mind there is a huge difference between the shape of the predicted values f_wb, and the shape of the cost curve.

JPGuittard · May 26, 2023, 9:02pm

So this means that for neural networks, it’s always possible that there is a solution which is better than the one that we accept as “good enough”?

JPGuittard · May 26, 2023, 9:13pm

Also, I feel like I’m missing something. A 3d plot of cost function basically shows me where the minimum cost (difference between y (hat) and y. So why can’t I just generate that plot and pick the lowest point of the lowest value . Why go through the bother of the gradient function calculations.
Is it because such an approach would be impractical at scale?

TMosh · May 26, 2023, 9:18pm

Possible, yes. But “always” is seldom a true statement when it comes to machine learning.

TMosh · May 26, 2023, 9:19pm

If you have a data set with several hundred features, you now have a plot with a several hundred dimensions. This can’t be visualized.

JPGuittard · May 26, 2023, 10:14pm

Let me rephrase:
Whenever we find a solution that is good enough, there is possibility of a possibly better solution existing.

JPGuittard · May 26, 2023, 10:15pm

By the way, I really appreciate that you and the other mentors take the time to mentor us in this course.

TMosh · May 26, 2023, 10:41pm

Yes, that’s true. But if that matters, then the definition of “good enough” wasn’t sufficient.

saifkhanengr · May 27, 2023, 5:34am

Hello @JPGuittard! Our ideal solution is to go to the deepest local minima (which is called Global minima). We tune hyperparameters for that but sometimes, we stuck in local minima and can’t reach the global minima.

Let me elaborate. I work as a petroleum engineer. When we produce oil from a reservoir, we do not produce 100%. We just use our available resources and technology to produce as many as we can to make a profit. But when producing more oil takes more resources and gives no or less profit, we do not bother to produce. Just shut the well and find the new reservoir. This same intuition can be applied to any business, right? For example, an airplane can fly like 1500 or 2000 km/h but they take some 1000 km/h because it gives them more profit.

So, when we trained a good model and accept it, there is still a chance to take one more step. But what it will cost? How much time it will take? What will be the return? If the human accuracy is 94%, a good model accuracy is 95%, and an excellent model accuracy is 98% but take double resources, what you will do?

Best,
Saif.

Christian_Simonis · May 27, 2023, 8:28am

Great question, @JPGuittard!

In addition to the excellent answers from @TMosh and @saifkhanengr: in general, we should also consider the optimization metric and the characteristics of the data which influence the characteristics of the cost function.

E.g when using MSE for logistic regression (which is not recommended) the cost function is not convex, see also: Why not Mean Squared Error(MSE) as a loss function for Logistic Regression? 🤔 | by Rajesh Shreedhar Bhat | Towards Data Science

Best regards
Christian

TMosh · May 27, 2023, 6:39pm

Just to be clear: Do not use MSE for logistic regression.

Christian_Simonis · May 27, 2023, 9:06pm

Yeah! Regarding:

this refers to the MSE of the evaluated probability function and labels as metric for loss computation in a classification problem, maximizing the likelihood for the observations. Here it is better to use the cross entropy to compute the loss, see also: LogisticRegression — scikit-learn 1.6.1 documentation

Side note: in sparse data sets, there is still some research ongoing in this field, especially if one is interested in the MSE of coefficients primarily. On this note, this article might be worth a read if someone is interested in it:

For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation.

Source

Have a good one!

Best regards
Christian

Topic		Replies	Views
Gradient descent C1_W1 Supervised ML: Regression and Classification week-module-1	2	514	July 31, 2022
Cost function - How can we make sure that we end up in the global minimum and not one of the local minima Supervised ML: Regression and Classification week-module-2	2	833	December 3, 2022
Gradient Descent two local minima Supervised ML: Regression and Classification week-module-1	5	154	May 12, 2024
Cost function stuck at local minima Neural Networks and Deep Learning coursera-platform	8	1458	July 5, 2024
Week 2 loss function Neural Networks and Deep Learning coursera-platform	1	579	May 23, 2021

When there are multiple local minima?

Related topics