Non-convex graph is not the reason we do not use the "Error Cost Function" as we defined in linear regression

popaqy · June 18, 2022, 9:39am

It is mentioned that we do not use the the error cost function the way we defined in linear regression, because it generates a non-convex graph that is harder for gradient descent.

If this was the case, then we would be happy with having the following situation:

But we are not, because this loss function does not calculate the proper decision boundary.

That’s why we have to define the loss function in a different way.

Having it defined properly, we have the situation, where we have our decision boundary calculated optimally:

Moreover, I think that many error functions have the property of generating non-convex graph. However, I think that this is not a problem as long as we get the results we need.

shanup · June 18, 2022, 11:29am

Hello @popaqy

I am not sure I understand your argument entirely, but let me give it a shot.

I feel you are mixing 2 different things:

Linear Regression vs Logistic Regression
Which Loss Function to use

Linear regression will find it difficult to define the boundary required to correctly classify similar scenarios as the one you have shown above. Hence, we use Logistic Regression.

Now that we have decided to use Logistic Regression to more correctly identify such boundaries, the question is which Loss function to use. The Logistic Regression equation fed into the conventional squared error loss function will not yield a convex graph and hence we opt for the logistic Loss function which will ensure a convex graph…we are well aware of the importance of a convex graph for Gradient descent to converge to the global minima.

Hope this helps.

popaqy · June 19, 2022, 6:29am

Okay, I got the idea.

At first I thought the squared cost function is not a definition that will lead us to the result we want. After a little bit of thinking I got to the conclusion that it eventually will.

I have a big question here. It is regarding my last paragraph

Moreover, I think that many error functions have the property of generating non-convex graph. However, I think that this is not a problem as long as we get the results we need.

I am aware that it is much easier to come up with an error function that would eventually generate a convex graph. But is it the case that we come up with such “perfect” graphs for every algorithm? And what if we do not?

shanup · June 19, 2022, 6:43am

At the end of the day, it is Gradient descent that gets us to the optimal values of W, b such that the Cost (J) is at a minimum.

The Convex graph becomes important because we take small steps in every iteration of training to reach to that minimum cost point. If the Cost Vs (W,b) graph is not convex, then it could have local minima as well as global minima. Our aim is to find the global minima and not get stuck at a local minima and get fooled into thinking that we are at the global minima - If it is not a convex graph there is the danger of this happening. Outcome: We would be left holding a model (W,b) that is sub-optimal - could be further improved and the accuracy of prediction made even better, if only we had found the Global minima during training.

Hence the need…the need for a Convex Graph.

rmwkwok · June 19, 2022, 8:51am

@popaqy , @shanup, I would like to contribute another angle to this discussion.

Is a good loss function enough to get a convex parameter space?

No, we need both the right loss function and right model. Convex space is actually quite rare.

Convex Example: linear model + squared Loss
With J=(y-a)^2 and a=wx,
\frac{\partial{^2J}}{\partial{w^2}} = 2x^2 \ge 0, and thus it is convex.

Non-convex example: 1 node + sigmoid activation + 1 node + squared Loss
This is a 2-layer NN with 1 node in each layer.
With J=(y-a)^2, a_2=w_2g(w_1x) and g(w_1x) = \frac{1}{1+\exp{(-w_1x)}},
\frac{\partial{^2J}}{\partial{w_1^2}} = -2x^2w_2^2g(1-g)[y-2(y+w_2)g+3w_2g^2]

It is not convex because when g=0.5 and w_2<0, \frac{\partial{^2J}}{\partial{w_1^2}} <0

When the model is non-linear, the property of being convex can’t be guaranteed.

Another point of discussion is, convex space is certainly great, but when we can’t ask for it due to the nonlinearity and complex nature of our model, should we fight to get to the global minimum? I don’t have an answer to this, but I would ask myself this: when my NN is big, is getting global minimum overfitting to the training data?

shanup · June 19, 2022, 9:18am

@rmwkwok

You bring up an important point, but have indeed opened up Pandora’s box by bringing in the multi-layer case wherein the quest for global minima could lead to the very thing that we are trying to avoid - Overfitting.

Considering that the question is from Course 1, I was hoping to keep it confined to basic functions such as a linear model or even a simple non-linear model such as logistic regression, wherein finding the global minima would ensure a better model, and not yet open up about multi-layer or deep layer networks wherein this might not be the case

Topic		Replies	Views
Visualizing Squared Error Cost function for Logistic regression in 2D Supervised ML: Regression and Classification week-3	3	591	February 17, 2024
Error cost function shape Supervised ML: Regression and Classification week-2	2	501	September 19, 2022
Gradient descent C1_W1 Supervised ML: Regression and Classification week-1	2	514	July 31, 2022
Logistic Regression: Difference between cost function & gradient descent Supervised ML: Regression and Classification week-3	5	574	August 8, 2022
Week3 lab 4 Supervised ML: Regression and Classification week-3	2	19	April 4, 2025

Non-convex graph is not the reason we do not use the "Error Cost Function" as we defined in linear regression

Related topics