# Derivative of Relu in output layer

Yes, that’s basically right. The only caveat is that you have to be a little more precise about matching the way that Prof Ng expresses things. There is the “loss” function L which gives a vector value with the loss for each sample. Then there is the “cost” function J which is the average of the loss values across the samples in the training set. The way Prof Ng decomposes things, he only uses J at the very final step where he computes the gradients of the weight and bias values. Everywhere else, he is computing “Chain Rule” factors. Notice again for the third time what the notation dAL means: it is the derivative of L, not of J, so you don’t have the summation and you don’t have the factor of \frac {1}{m}.

L(Y, A^{[L]}) = \displaystyle \frac {1}{2} (A^{[L]} - Y)^2

\displaystyle \frac {\partial L}{\partial A^{[L]}} = (A^{[L]} - Y)

If you wanted to compute the partial derivative of J, it would be:

\displaystyle \frac {\partial J}{\partial A^{[L]}} = \frac {1}{m} \sum_{i = 1}^m (A_i^{[L]} - Y_i)

But that is not really what we need to plug into the way Prof Ng has structured all the layers of functions here.

I am taking advantage of the feature of formatting LaTeX expressions here on Discourse. That was explained on the DLS FAQ Thread. Of course that assumes you are familiar with LaTeX, which is a language Prof Donald Knuth invented for formatting mathematical expressions. If that is new to you, just google “LaTeX” and you’ll find plenty of useful info.

2 Likes