Derivative of Relu in output layer

Hello! I hope you are doing well.
I’ve completed this DLS course 1 and now practicing with random data. I am wondering what the derivative of A (output) is if ReLU is the activation function in the last layer. I am practicing with linear regression problem. However, this course deals with logistic where the derivative of the final layer (sigmoid) is:
dAL = - (np.divide(Y, AL) - np.divide(1 - Y), (1 - AL))

But what is the derivative of dAL if we use ReLU for the output layer?
Thanks in advance.

Saif Ur Rehman.

1 Like

Hi @saifkhanengr
the derivative of Relu functionis so easy as Relu function is
so that the derivative will be equal this image

please feel free to ask any questions,

1 Like

The other key thing to realize is that the dAL value that you show is not the derivative of the activation function at the output layer, right? It’s the derivative of the loss function and that was “cross entropy” loss in the case you show, w.r.t. the activation output of the last layer. But if you’re using ReLU as the output layer activation (meaning that your problem is a regression problem, not a classification problem), then cross entropy is not going to work. Take a look at the logs in the cross entropy cost: they assume that the input is a probability value between 0 and 1, in other words the output of sigmoid. That’s not true anymore. If you have a regression problem, you need to be using a distance based loss function like MSE.


Thank you @AbdElRhaman_Fakhry and @paulinpaloalto for your time and for your reply.
OK, I got the derivative term but how to write it in code? Like this for 2-layers:

    if A1 > 0:
        dA2 = 1
    if A1 < 0:
        dA2 = 0

Yes, I am using MSE: cost = np.sum((AL-Y)**2)/(2*m)

Actually, I am using all the functions of DLS Course 1 week 4 assignments but tweaking them for a regression problem. How to implement the derivative of dA2 in the image below?

1 Like

Why does dA2 depend on A1?

They gave us the code for the derivative of ReLU in Week 4, right? Just look at the code in relu_backward.

You also have to keep track of the meaning of Prof Ng’s notation. That dA2 value is what he calls dAL in the fully general 4 layer case.

dAL = \displaystyle \frac {\partial L}{\partial A^{[L]}}

So what will that be if L is the MSE loss function? Also note that dA2 doesn’t really involve ReLU, it’s purely the loss function: A^{[2]} is the output of ReLU, right? The way you have written the code, the derivative of ReLU will be taken care of for you when linear_activation_backward calls relu_backward, right?

It’s great that you are extending this code. You always learn something interesting when you do that. In this case, the first step in the learning is to understand how the existing code works. :nerd_face:


@paulinpaloalto Thank you sir for your detailed comment. I am stupid, didn’t get it yet.

To my understanding, to call linear_activation_backward, first we need the derivative of the last layer’s activation (dAL or dA2 in a 2-layer case). Because linear_activation_backward returns dA_prev, dW, db. I am struggling with this, how to write the dAL or dA2 (in a 2-layer case), regression problem.

I don’t understand this point. To my understanding, derivative terms have nothing to do with MSE. Right, sir? For logistic, the derivative term involves Y and AL but again, this is not a loss function.

relu_backward returns dZ which then we can use it as input for linear_backward function which will return dA_prev. But I am struggling to find dAL/dA2 (not dA_prev).

I am so unintelligent, sir. I highly appreciate your time and detailed comment, but I didn’t get it. Can you please make it more simple? I will be extremely thankful to you.

Saif Ur Rehman.

1 Like

Yes, that’s right, but I think the only confusion is about what dAL or dA2 actually is. It’s not the derivative of the activation, right? A2 is the output of the activation. Remember the definition of dA2, which I stated earlier:

dA2 = \displaystyle \frac {\partial L}{\partial A^{[2]}}

And think back to what that value is in the case that we’re doing a classification problem with cross entropy as the loss function. Remember the formula for dAL in the general case was:

dAL = \displaystyle -\frac {Y}{A^{[L]}} + \frac {1 - Y}{1 - A^{[L]}}

What is that the derivative of? It is this function differentiated w.r.t. A^{[L]}:

L(Y, A^{[L]}) = - Y log(A^{[L]}) - (1 - Y) log(1 - A^{[L]})

That’s the vector loss function for cross entropy, right? The derivative of sigmoid is not required to compute that formula for dAL.

Now we’re translating that into your new case where it’s not a classification problem, but a regression problem and the A^{[2]} values are continuous real number outputs with an infinite range of values. So we can’t use cross entropy as the loss anymore and we’ve decided to use MSE as the loss instead.

So how would you apply the idea above to the case of MSE as the loss function? That dA2 term will not involve ReLU, but it will be the derivative of the MSE loss w.r.t. the output A^{[2]}.


Aha. I got that. I didn’t know that dAL is the derivative of the Loss function. Thanks for this point. Everything is now crystal clear to me.

So, MSE is np.sum((AL-Y)**2)/(2*m). Now we need to find its derivative with respect to A.
I think it is:
dAL= np.sum((A2-Y)*A2)/m or np.sum((A2-Y))/m
but I studied calculus some 5 years ago and I’m not sure. Kindly correct me, sir.

One more thing: how to insert an equation in this chat box. You inserted multiple equations which make the chat clear and easy to understand but I cannot find that option.

Once again, thanks a million sir for making my concepts clear. I am highly indebted to you.


1 Like

I just revisited a few videos of DLS course 1 week 4 and found that it was my fault that I didn’t know that dAL is the derivative of the loss function (not the derivative of AL). Prof. Andrew explains that clearly, but I missed that. However, one thing makes me curious, why it is the derivative of Loss and not the derivative of dAL?

Furthermore, initially, we initiate w1 and b1 randomly, and then in backward, we determine dw1 and db1 after the first iteration. Upon the start of the second iteration, do we use dw1 and db1 as the value of w1 and b1 (in the second iteration)? I think YES.

1 Like

Yes, that’s basically right. The only caveat is that you have to be a little more precise about matching the way that Prof Ng expresses things. There is the “loss” function L which gives a vector value with the loss for each sample. Then there is the “cost” function J which is the average of the loss values across the samples in the training set. The way Prof Ng decomposes things, he only uses J at the very final step where he computes the gradients of the weight and bias values. Everywhere else, he is computing “Chain Rule” factors. Notice again for the third time what the notation dAL means: it is the derivative of L, not of J, so you don’t have the summation and you don’t have the factor of \frac {1}{m}.

L(Y, A^{[L]}) = \displaystyle \frac {1}{2} (A^{[L]} - Y)^2

\displaystyle \frac {\partial L}{\partial A^{[L]}} = (A^{[L]} - Y)

If you wanted to compute the partial derivative of J, it would be:

\displaystyle \frac {\partial J}{\partial A^{[L]}} = \frac {1}{m} \sum_{i = 1}^m (A_i^{[L]} - Y_i)

But that is not really what we need to plug into the way Prof Ng has structured all the layers of functions here.

I am taking advantage of the feature of formatting LaTeX expressions here on Discourse. That was explained on the DLS FAQ Thread. Of course that assumes you are familiar with LaTeX, which is a language Prof Donald Knuth invented for formatting mathematical expressions. If that is new to you, just google “LaTeX” and you’ll find plenty of useful info.


This entire process is just a big application of the Chain Rule. Computing dAL is just one step in the process. Of course what we really need to run back propagation is the gradients of the W and b values. So you have to march backwards through the layers and apply the Chain Rule. This is the definition of the gradient of W^{[L]}:

dW^{[L]} = \displaystyle \frac {\partial J}{\partial W^{[L]}}

To compute that, we apply the Chain Rule in a bunch of steps:

\displaystyle \frac {\partial J}{\partial W^{[L]}} = \frac {\partial J}{\partial L} * \frac {\partial L}{\partial A^{[L]}} * \frac {\partial A^{[L]}}{\partial Z^{[L]}} * \frac {\partial Z^{[L]}}{\partial W^{[L]}}

If you think about the meaning of that formula a bit and then study how L_model_backward calls linear_activation_backward calls relu_backward and linear_backward, that is what all that is really doing. Of course you can see that both dAL and the derivative of the activation function are showing up there as factors, right? The derivative of the activation function is this one:

\displaystyle \frac {\partial A^{[L]}}{\partial Z^{[L]}}

You start with randomly initialized values of all the W and b parameters. Then at each iteration, you apply the gradients dW and db to compute the new (hopefully better) values of W and b for all layers. That is the “update parameters” logic that we implemented in Week 3 and Week 4:

W = W - \alpha * dW

but applied individually at each layer …


Now I have everything clear. Thank you so much, sir. You are amazing.

PaulMielke = GreatMentor + AffablePerson


1 Like

Hello sir @paulinpaloalto I hope you are doing well.
Below is my first NN model from scratch (regression problem). It did a poor job. In training, some points fit but some are extremely poor. It seems both high bias, and high variance. How can I improve it? This graph shows training (not dev/test).

1 Like

Hello SaifKHanengr,

If your model is showing high bias and high variance then you should certainly know the factors causing these conditions. Through web search, you can get many articles covering this topic.

Here’s a link that could be useful.

You must do it step by step by following the curve.

1 Like

This is probably covered on Rashmi’s link, but it looks like there is some definite pattern in the wrong answers: they are all just 0 and they seem to alternate with the correct answers. Seems like it’s worth some analysis to see if you can see any patterns in the inputs that give bad results versus the good ones. The other thing to consider is that maybe it’s not such a great idea to use ReLU as the output layer activation. The reason you are getting zero answers must be that the predictions were negative at the linear activation level, right? Try using Leaky ReLU and see if that gives negative predictions for some values. I assume negative values would not make sense in your application. Other possibilities would be swish. Or if any output value between -\infty and \infty makes sense, just eliminate the output activation altogether.


Also stepping back a couple of steps further, maybe a Neural Network is overkill for your application. Your data looks like it could be fit with a relatively low degree polynomial. Why not try Polynomial Regression of degree 3 or 4 and see what happens?

Mind you, an NN should work, though. I’m not saying you don’t have a real problem to solve to figure out why your first try with the NN doesn’t work very well. You should be able to get that to work, but the other point is maybe you are working too hard by starting with too complex a solution.


Thank you @Rashmi and @paulinpaloalto for your time and reply. I highly appreciate it.
My data doesn’t have any negative values. It’s a dummy data created by:

X = np.arange(0, 20, 1)
Y = 3*X**2 + 3*X

So, no negatives.

You are right, sir. I tried Polynomial and it fits very well. But now I am practicing NN from scratch with random data to learn and explore more.

PS: I will go through Rashmi’s link in two days as I am busy with DLS second course. Then I will update here if failed to train a good model. . .

Ah, ok, then it makes sense that Polynomial Regression will fit that. I think I can even name the coefficients it will learn :laughing:

So it is the case that negative answers are not appropriate in your problem (update: this is wrong, see next post), but evidently the linear output of the final layer of your network is producing negative answers: how else do you end up with \hat{y} values that are 0 as the output of ReLU? So now the question is why the learning is not correcting that behavior of your existing network? That’s the key question you need to investigate when you get back to this. Removing ReLU at the output layer and then just letting MSE punish those bad negative values might help. At least that would be one experiment I would try …

When I first saw your output, I thought maybe it indicated some kind of bug in your implementation of MSE or the derivatives of that and ReLU. But then why would it fit perfectly on some inputs and not on others? So there is a legitimate and interesting mystery to be solved here and I’m curious to know how it turns out.

Let us know what you learn when you have time to investigate further! Science! :nerd_face:


Actually if you solve for the minimum of the function:

y = 3x^2 + 3x

You’ll find that it is at x = -\frac {1}{2} and the y value there is -\frac{3}{4} and the zeros of the function are at x = -1 and x = 0, so negative values are possible. Not sure if that in and of itself would cause the problem you are seeing, but it does indicate that ReLU is not the right choice for your output layer activation.


Hello @saifkhanengr,

Just looking at these plots I will first relate them to high variance as Rashmi pointed out. Would you mind sharing your notebook with me in a Direct Message? I am interested in this example.