It is mentioned that we do not use the the error cost function the way we defined in linear regression, because it generates a non-convex graph that is harder for gradient descent.
If this was the case, then we would be happy with having the following situation:
Moreover, I think that many error functions have the property of generating non-convex graph. However, I think that this is not a problem as long as we get the results we need.
I am not sure I understand your argument entirely, but let me give it a shot.
I feel you are mixing 2 different things:
Linear Regression vs Logistic Regression
Which Loss Function to use
Linear regression will find it difficult to define the boundary required to correctly classify similar scenarios as the one you have shown above. Hence, we use Logistic Regression.
Now that we have decided to use Logistic Regression to more correctly identify such boundaries, the question is which Loss function to use. The Logistic Regression equation fed into the conventional squared error loss function will not yield a convex graph and hence we opt for the logistic Loss function which will ensure a convex graph…we are well aware of the importance of a convex graph for Gradient descent to converge to the global minima.
At first I thought the squared cost function is not a definition that will lead us to the result we want. After a little bit of thinking I got to the conclusion that it eventually will.
I have a big question here. It is regarding my last paragraph
Moreover, I think that many error functions have the property of generating non-convex graph. However, I think that this is not a problem as long as we get the results we need.
I am aware that it is much easier to come up with an error function that would eventually generate a convex graph. But is it the case that we come up with such “perfect” graphs for every algorithm? And what if we do not?
At the end of the day, it is Gradient descent that gets us to the optimal values of W, b such that the Cost (J) is at a minimum.
The Convex graph becomes important because we take small steps in every iteration of training to reach to that minimum cost point. If the Cost Vs (W,b) graph is not convex, then it could have local minima as well as global minima. Our aim is to find the global minima and not get stuck at a local minima and get fooled into thinking that we are at the global minima - If it is not a convex graph there is the danger of this happening. Outcome: We would be left holding a model (W,b) that is sub-optimal - could be further improved and the accuracy of prediction made even better, if only we had found the Global minima during training.
@popaqy , @shanup, I would like to contribute another angle to this discussion.
Is a good loss function enough to get a convex parameter space?
No, we need both the right loss function and right model. Convex space is actually quite rare.
Convex Example: linear model + squared Loss
With J=(y-a)^2 and a=wx, \frac{\partial{^2J}}{\partial{w^2}} = 2x^2 \ge 0, and thus it is convex.
Non-convex example: 1 node + sigmoid activation + 1 node + squared Loss
This is a 2-layer NN with 1 node in each layer.
With J=(y-a)^2, a_2=w_2g(w_1x) and g(w_1x) = \frac{1}{1+\exp{(-w_1x)}}, \frac{\partial{^2J}}{\partial{w_1^2}} = -2x^2w_2^2g(1-g)[y-2(y+w_2)g+3w_2g^2]
It is not convex because when g=0.5 and w_2<0, \frac{\partial{^2J}}{\partial{w_1^2}} <0
When the model is non-linear, the property of being convex can’t be guaranteed.
Another point of discussion is, convex space is certainly great, but when we can’t ask for it due to the nonlinearity and complex nature of our model, should we fight to get to the global minimum? I don’t have an answer to this, but I would ask myself this: when my NN is big, is getting global minimum overfitting to the training data?
You bring up an important point, but have indeed opened up Pandora’s box by bringing in the multi-layer case wherein the quest for global minima could lead to the very thing that we are trying to avoid - Overfitting.
Considering that the question is from Course 1, I was hoping to keep it confined to basic functions such as a linear model or even a simple non-linear model such as logistic regression, wherein finding the global minima would ensure a better model, and not yet open up about multi-layer or deep layer networks wherein this might not be the case