Flat points : are those completely flat?

This is from choosing the activation function chapter.
The minimum point of the cost function of linear regression was completely flat, so the derivative of J was 0 at that point.
But I think the points in the image are not fully flat. if they were, then the gradient descent algorithm would stop running after reaching the derivative of J =0, but Prof. Andrew is marking those points, indicating that gradient descent keeps running.

Please correct me if you find anything wrong with my concept.

It’s just a sketch. A real plot of cost vs w would have a continual (though varying) slope down to the minimum cost.

1 Like

Does the graph represent the MSE cost function or the loss cost function?

For the subject of minimization, it doesn’t really matter. The curve shown could be either for classification or regression. Both will have similar shapes.

This is discussed starting at 0:52 in that video.

If the output activation is sigmoid, then it’s the logistic cost function.
If the output activation is linear, then it’s MSE.

Please clear my confusion.

Does the cost function graph for the sigmoid represent the cost function with the logarithm loss function ( L(f(x),y) ) in this slide?

choosing the activation function chapter.

It isn’t specified, because it doesn’t matter. Both types of cost functions will have a similar shape.

the cost function with logarithm ensures the cost function is convex and therefore ensure convergence to the global minimum. But the graph isn’t representing the global minimum :thinking:

In a neural network that has a hidden layer, you don’t necessarily get a global minimum.

1 Like

This is because the cost function of a neural network with hidden layers is not convex.

1 Like

I figure out the scene in my imagination; why is it not convex?
If there are 2 neurons in layer 1 and 3 inputs (x0,x1,x2) in layer 0, then each w0 and w1 of the 2 neurons will be 1x3-dimension. So, the cost function of the output of layer 1 will have a 3x2 dimension of w (2 for each column and 3 for each row of w0 and w1). So, it will be complicated, making the cost function non-convex.

Let me know if I am still wrong. :stuck_out_tongue:

The issue is not the number of units in each layer.

What makes the NN cost function not convex is that the hidden layer has a non-linear function (such as ReLU, sigmoid, or tanh).

Hi Tom,
Is it the correct explanation?

#Nonconvexity #NeuralNetwork

I don’t agree with that description, because their example (multiplying parameters) is not how a NN works.

Oh :frowning: ! I thought it was correct since I could find that
The breakdown equation of the final layer’s activation function has the multiplication of the final layer’s w with the w of the previous layers.

No, it’s the sum of the products of each layer’s (weight * activation).

So it’s more like (w1*a1)+(w2 * a2)

(w1*w2) never appears there.

A real proof of the NN cost function not being convex requires quite a bit of calculus.

1 Like

What if we break down a1 and a2? I actually did that, and so I found multiplication in between two w.

Yes, that is true. My previous reply was in error.

1 Like

Thank you Tom, I wanted to hear it.

Thanks for pointing out my mistake. I appreciate the opportunity to improve.

1 Like

It happens, and it gives me hope that experts can make minor errors too :slight_smile: