Hello everybody,
I would like you to help me with this doubt.
When professor write dZ, dW and db , they refer to the derivative of the cost function (J) or the derivative of the loss function (L)?
I ask this because when i did the partial derivative of J respect to Z the result is (1/m)*(A - Y) which is different with dZ = A - Y.
I appreciate your help
I think the difference is that “L” is for one example, and “J” is the sum over all of the examples.
Since you want to minimize the cost, in practice we use J. The only difference is the constant 1/m factor.
The difference is exactly what you said,
what i want to know is why dZ is A - Y in the video, may be it refers to the derivative of the loss function.
The videos are not mathematically perfect or consistent. I would not be overly concerned about it.
derivative of J respect to A: (dL/dA) = (1/m) * (-y/A+(1-y)/(1-A))
derivative of J respect to Z: (dL/dZ) = (dA/dZ) * (dL/dA)
= A*(1-A) * (1/m)*(-y/A+(1-y)/(1-A))
= A - y
Note: This is for sigmoid function: A = sigmoid(Z)
I forgot to put (1/m) into this
[quote=“F0ngTr4n11, post:5, topic:774795”]
Note: This is for sigmoid function: A = sigmoid(Z)
[/quot
and the Cost Function:
J = (1/m) * (-Y*log(A) - (1-Y)*log(1-A))
Yes, dZ is the derivative of L. As Tom says, Prof Ng is not consistent in his use of the “d” prefix to indicate a gradient. But here’s the way to tell:
The only cases in which the gradients are derivatives of J are the dW and db values. Those are the gradients we actually apply to update the parameters W^{[l]} and b^{[l]}.
All other gradients we see in the formulas are just Chain Rule factors that are used to calculate dW and db, so they are usually derivatives of L or some other intermediate value. The other way you can tell is if they are vectors or arrays: in that case you’re looking at a derivative of L.
This has been discussed many times before, e.g. here and here and here.