This is from choosing the activation function chapter.
The minimum point of the cost function of linear regression was completely flat, so the derivative of J was 0 at that point.
But I think the points in the image are not fully flat. if they were, then the gradient descent algorithm would stop running after reaching the derivative of J =0, but Prof. Andrew is marking those points, indicating that gradient descent keeps running.
Please correct me if you find anything wrong with my concept.
For the subject of minimization, it doesn’t really matter. The curve shown could be either for classification or regression. Both will have similar shapes.
This is discussed starting at 0:52 in that video.
If the output activation is sigmoid, then it’s the logistic cost function.
If the output activation is linear, then it’s MSE.
the cost function with logarithm ensures the cost function is convex and therefore ensure convergence to the global minimum. But the graph isn’t representing the global minimum
I figure out the scene in my imagination; why is it not convex?
If there are 2 neurons in layer 1 and 3 inputs (x0,x1,x2) in layer 0, then each w0 and w1 of the 2 neurons will be 1x3-dimension. So, the cost function of the output of layer 1 will have a 3x2 dimension of w (2 for each column and 3 for each row of w0 and w1). So, it will be complicated, making the cost function non-convex.
Oh ! I thought it was correct since I could find that
The breakdown equation of the final layer’s activation function has the multiplication of the final layer’s w with the w of the previous layers.