# Flat points : are those completely flat?

This is from choosing the activation function chapter.
The minimum point of the cost function of linear regression was completely flat, so the derivative of J was 0 at that point.
But I think the points in the image are not fully flat. if they were, then the gradient descent algorithm would stop running after reaching the derivative of J =0, but Prof. Andrew is marking those points, indicating that gradient descent keeps running.

Please correct me if you find anything wrong with my concept.

Itâ€™s just a sketch. A real plot of cost vs w would have a continual (though varying) slope down to the minimum cost.

1 Like

Does the graph represent the MSE cost function or the loss cost function?

For the subject of minimization, it doesnâ€™t really matter. The curve shown could be either for classification or regression. Both will have similar shapes.

This is discussed starting at 0:52 in that video.

If the output activation is sigmoid, then itâ€™s the logistic cost function.
If the output activation is linear, then itâ€™s MSE.

Does the cost function graph for the sigmoid represent the cost function with the logarithm loss function ( L(f(x),y) ) in this slide?

choosing the activation function chapter.

It isnâ€™t specified, because it doesnâ€™t matter. Both types of cost functions will have a similar shape.

the cost function with logarithm ensures the cost function is convex and therefore ensure convergence to the global minimum. But the graph isnâ€™t representing the global minimum

In a neural network that has a hidden layer, you donâ€™t necessarily get a global minimum.

1 Like

This is because the cost function of a neural network with hidden layers is not convex.

1 Like

I figure out the scene in my imagination; why is it not convex?
If there are 2 neurons in layer 1 and 3 inputs (x0,x1,x2) in layer 0, then each w0 and w1 of the 2 neurons will be 1x3-dimension. So, the cost function of the output of layer 1 will have a 3x2 dimension of w (2 for each column and 3 for each row of w0 and w1). So, it will be complicated, making the cost function non-convex.

Let me know if I am still wrong.

The issue is not the number of units in each layer.

What makes the NN cost function not convex is that the hidden layer has a non-linear function (such as ReLU, sigmoid, or tanh).

Hi Tom,
Is it the correct explanation?

#Nonconvexity #NeuralNetwork

I donâ€™t agree with that description, because their example (multiplying parameters) is not how a NN works.

Oh ! I thought it was correct since I could find that
The breakdown equation of the final layerâ€™s activation function has the multiplication of the final layerâ€™s w with the w of the previous layers.

No, itâ€™s the sum of the products of each layerâ€™s (weight * activation).

So itâ€™s more like (w1*a1)+(w2 * a2)

(w1*w2) never appears there.

A real proof of the NN cost function not being convex requires quite a bit of calculus.

1 Like

What if we break down a1 and a2? I actually did that, and so I found multiplication in between two w.

Yes, that is true. My previous reply was in error.

1 Like

Thank you Tom, I wanted to hear it.

Thanks for pointing out my mistake. I appreciate the opportunity to improve.

1 Like

It happens, and it gives me hope that experts can make minor errors too