In some 3D plots there can be multiple local minima.
If our goal is to minimize the cost function, I assume that means we want to end up at the bottom of the deepest local minima. But how do we do that? As we begin our gradient descent, how do we ensure that we are heading towards the deepest valley?
If youâre using linear regression or logistic regression, those cost functions are known to be convex. They have only one minimum.
For more complex systems that use neural networks, those can have local minima. In that case weâre often satisfied to find a âgood enoughâ solution, rather than try to find the solution that has the lowest minimum (if the lowest minimum doesnât give significantly better predictions than the âgood enoughâ one). It depends on what problem youâre solving and what your goals are.
Keep in mind there is a huge difference between the shape of the predicted values f_wb, and the shape of the cost curve.
So this means that for neural networks, itâs always possible that there is a solution which is better than the one that we accept as âgood enoughâ?
Also, I feel like Iâm missing something. A 3d plot of cost function basically shows me where the minimum cost (difference between y (hat) and y. So why canât I just generate that plot and pick the lowest point of the lowest value . Why go through the bother of the gradient function calculations.
Is it because such an approach would be impractical at scale?
Possible, yes. But âalwaysâ is seldom a true statement when it comes to machine learning.
If you have a data set with several hundred features, you now have a plot with a several hundred dimensions. This canât be visualized.
Let me rephrase:
Whenever we find a solution that is good enough, there is possibility of a possibly better solution existing.
By the way, I really appreciate that you and the other mentors take the time to mentor us in this course.
Yes, thatâs true. But if that matters, then the definition of âgood enoughâ wasnât sufficient.
Hello @JPGuittard! Our ideal solution is to go to the deepest local minima (which is called Global minima). We tune hyperparameters for that but sometimes, we stuck in local minima and canât reach the global minima.
Let me elaborate. I work as a petroleum engineer. When we produce oil from a reservoir, we do not produce 100%. We just use our available resources and technology to produce as many as we can to make a profit. But when producing more oil takes more resources and gives no or less profit, we do not bother to produce. Just shut the well and find the new reservoir. This same intuition can be applied to any business, right? For example, an airplane can fly like 1500 or 2000 km/h but they take some 1000 km/h because it gives them more profit.
So, when we trained a good model and accept it, there is still a chance to take one more step. But what it will cost? How much time it will take? What will be the return? If the human accuracy is 94%, a good model accuracy is 95%, and an excellent model accuracy is 98% but take double resources, what you will do?
Best,
Saif.
Great question, @JPGuittard!
In addition to the excellent answers from @TMosh and @saifkhanengr: in general, we should also consider the optimization metric and the characteristics of the data which influence the characteristics of the cost function.
E.g when using MSE for logistic regression (which is not recommended) the cost function is not convex, see also: Why not Mean Squared Error(MSE) as a loss function for Logistic Regression? đ¤ | by Rajesh Shreedhar Bhat | Towards Data Science
Best regards
Christian
Just to be clear: Do not use MSE for logistic regression.
Yeah! Regarding:
this refers to the MSE of the evaluated probability function and labels as metric for loss computation in a classification problem, maximizing the likelihood for the observations. Here it is better to use the cross entropy to compute the loss, see also: LogisticRegression â scikit-learn 1.6.1 documentation
Side note: in sparse data sets, there is still some research ongoing in this field, especially if one is interested in the MSE of coefficients primarily. On this note, this article might be worth a read if someone is interested in it:
For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation.
Have a good one!
Best regards
Christian