Derivative of Relu in output layer

saifkhanengr · November 29, 2022, 12:25pm

Hello Rashmi! Thanks for your reply.
This is not a classification problem but a regression problem. My purpose is to practice NN with regression problems because I work as a Petroleum engineer where we have regression problems.
I know dZ = 1 is wrong as I don’t know its derivative. If you know, kindly let me know.

Rashmi · November 29, 2022, 12:30pm

Got it, SaifKhanengr,

Alright. Great to hear that! I would leave this discussion open for Super Mentors, as they have larger experience in dealing with such exercises

But would definitely keep a track of this thread.

All the good luck!

paulinpaloalto · November 29, 2022, 4:39pm

I think your no_relu and no_relu_backward functions look correct to me. You’re right that what you’re effectively saying is that the activation function at the output layer is the “identity” function, so its derivative is 1 everywhere.

So I would have expected that to work. What type of error are you seeing with that code? Note that I would have thought that this was the easy part of converting from classification to regression. The hard part is finding all the places where the different cost/loss function needs to affect the code.

paulinpaloalto · November 29, 2022, 4:42pm

Eeeek! Sorry, I wasn’t thinking hard enough when I wrote the first response. Your no_relu_backward is not correct. Remember that what we are implementing there is:

dZ = dA * g'(Z)

Meaning that we’re not just returning the derivative of the activation function. So the fact that the derivative is 1 does not mean that the return value is 1, right? It means it’s equal to dA.

Actually while you’re at it, I’d feel more comfortable if you did the assignment of

A = Z

in no_relu with a method that produced a separate copy. They way you implemented it, A ends up being another reference to the same global object. I can’t think of a case in which the return value A is going to get modified, so it’s probably no harm done. But it just introduces some risk of later unpleasant surprises. There’s a reason why they did that np.array(..., copy = True) there in the original code you are copying. Please see this post for more information about how object references work.

saifkhanengr · November 30, 2022, 5:43am

Thank you, sir, for your time and detailed comments. I understand and implement it.
I changed A = Z in no_relu to A = Z.copy(). Furthermore, I corrected the derivative in no_relu_backward, (dZ = dA).
I ran the code and got poor results with two errors (attached image with plt.title(“no_relu at last layer). Then I tried to debug by running every function separately in a separate notebook. Instead of running all the functions under the two_layer_model function or L_layer_model function, I ran them separately (one iteration) and it doesn’t give any error. Then I ran two_layer_model for 10 iterations and got no error; then for 15 iterations and got these two errors. I don’t know how "overflow encountered in square cost = np.sum((AL-Y)**2)/(2m)”* and “invalid value encountered in add Z = np.dot(W,A) + b” occurs after 15 iterations.

saifkhanengr · November 30, 2022, 5:48am

Moreover, I also tried leaky_relu. I define it like that:

def leaky_relu(Z):
    A = np.maximum(0.01*Z,Z)
    assert(A.shape == Z.shape)
    cache = Z 
    return A, cache

And its derivative like that:

def leaky_relu_backward(dA, cache):
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    dZ = np.where(Z > 0, 1, 0.05)
    assert (dZ.shape == Z.shape)
    return dZ

I got the derivative idea from here. Is my leaky_relu_backward correct or not?

By the way, it also lead to poor result.

saifkhanengr · November 30, 2022, 5:51am

Furthermore, I tried, again, with relu in the last layer, and got surprising results. The more the number of iterations, the poorer the fit. 800 iterations and 8000 iterations, as you can see in the attached images.

saifkhanengr · November 30, 2022, 2:01pm

Aha. I just made one change to my code and got a good fit for relu at the last layer and for no_relu at the last layer. The change is the derivative dA2 (two-layers-NN model). Previously, it was:
dA^{[2]} = (A^{[2]}-Y)
But I changed it to:
dA^{[2]} = (1./m) *(A^{[2]}-Y)

Actually, I was checking opt_utils_v1a file of the Optimization Methods (DLS course 2 Week 2 Assignment) and found that:
dz^{[3]} = 1./m * (a^{[3]} - Y).
I don’t know the background of that and how it works. But two things confuse me.

opt_utils_v1a define the derivative of Z (dz3) and I did it with A (dA2) but got improved results.
leaky_relu still does a poor job.

paulinpaloalto · November 30, 2022, 3:51pm

No, sorry, but you made the exact same mistake that you made in your original implementation of no_relu_backward. Remember that dZ is not just the derivative of the activation function, right? You’ve just returned g'(Z) again.

saifkhanengr · November 30, 2022, 3:59pm

So, I need to do this: dZ = np.where(Z > 0, dA, 0.05*dA). Right?

paulinpaloalto · November 30, 2022, 4:00pm

Sure, that’s one way to do it. But doesn’t this look simpler and a more obvious translation of the formula we are trying to implement:

dZ = dA * np.where(Z > 0, 1, 0.05)

The other advantage of doing it that way is that you can eliminate copy of dA to dZ and all the np.array code. Simpler and clearer is better, right?

saifkhanengr · November 30, 2022, 4:00pm

Got good fit by this.

saifkhanengr · November 30, 2022, 4:08pm

Thanks a million, sir, for your unlimited guidance and time. Got a good fit.
But did you notice that the first 6 values of X give y = 0 (in both no_relu and leaky_relu). Furthermore, adding more data (samples) leads to a poor fit in all cases (relu, leaky_relu, no_relu). I am wondering why?

In the last, what do you say about that?

paulinpaloalto · November 30, 2022, 4:56pm

Well, notice that the points for which you get \hat{y} = 0 are almost exactly the points for which your function gives negative values, right? As I commented earlier on this thread, the minimum of your function is at x = -\frac{1}{2} and it hits zero at x = -1 and x = 0.

Also notice that the results from after you fixed leaky_relu_backwards are the same as they were before, but you call them “good” after the fix. I don’t see any change in the graph, but maybe the resolution on the y axis is just too low to really see the behavior in detail, especially what is happening near 0.

The weird behavior you show where the output becomes basically a step function at high iterations indicates that there is still something wrong with your code as well.

You have to really look at the code carefully and understand what is going on. You have to be careful about mixing the pure general code from C1 W4 with any of the “hard-coded” implementations from C2 W1 and W2. There they are building fixed three layer networks and they wanted to keep the code as simple as possible, so they took some shortcuts with how they handled the factor of \frac {1}{m} that arises from the fact that the dW and db gradients are derivatives of J and everything else are derivatives of L. In the full formulas for back prop, note that Prof Ng only shows the factor of \frac {1}{m} in the formulas for dW and db, but you can sort of cheat and include it in the output layer dA value and it simplifies the code if you are doing the “hard-coded for 3 layers” approach. But if you are building fully general code based on linear_activation_backward and all that, you would be wiser not to copy from the C2 implementations.

paulinpaloalto · November 30, 2022, 8:50pm

Oh, wait. I noticed another bug. Here’s your “backward” code:

saifkhanengr:

def leaky_relu_backward(dA, cache):
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    dZ = np.where(Z > 0, 1, 0.05)
    assert (dZ.shape == Z.shape)
    return dZ

So notice that you hard-code the leaky slope to 0.01 in one case and to 0.05 in the other case. If that’s still the way the code looks, I would expect that to cause problems. Why not make slope a parameter, the way I showed. That way it’s easier to get the code right and then the “hard-coding” only happens where you invoke the functions. It’s also better because you may want to test with different values of the slope, so making it a global variable and then passing that to both the “forward” and “backward” routines is clearly the way to go: you only have to change it in one place. Modularity is a Good Thing ™!

Of course that’s the old version of your code before you fixed the “just return the derivative” bug. Maybe you fixed both bugs at the same time?

saifkhanengr · December 1, 2022, 4:19am

Hello Sir! Thanks for your detailed guidance.

I am very new to Python, so, I did it in a similar way as in the assignment file.

So, adding that 1/m in the general code (L-layers) is not recommended. Right?

In the last, I would like to say that my mind is in a mishmash. Don’t know how to get that code to work well. So, I decided to retake DLS course 1, again, and then come back to this problem. But I am extremely indebted to you sir. You are an exceedingly kind and delightful teacher and do not get frustrated by my nonsense questions. Truly big-hearted.

paulinpaloalto · December 1, 2022, 5:06am

That’s fine, but it still requires that you are consistent in the “slope” values that you use, right? Did you try fixing that to be consistent and see if it makes a difference?

There are two steps, right? First you have to understand what the math says. Then second you have to understand how the code implements what the math says. I don’t know if you really need to take all of Course 1 again, but that’s up to you. But at the very least, start by reviewing this slide (which is about 14:20 into this lecture in Week 3):

That’s where he gives the general formulas for back propagation at least in the 2 layer case. Notice where the factor of \frac {1}{m} occurs. You could change the formulas and put it only as a factor on dZ^{[2]} and eliminate it everywhere else. That would also be mathematically correct, but that’s not the way Prof Ng wrote the formulas.

Now study how the code works driven by L_model_backward at the top level. Watch how that top level function calls linear_activation_backward, which in turn calls linear_backward and relu_backward and sigmoid_backward. Where is each part of that math computation being done? Where do the factors of \frac {1}{m} show up?

I recommend that you just stick with doing everything in that coding style, rather than using the “hard-wired only 3 layers” style that you can see in Course 2. Once you get this working, you can then use it for other problems that happen to require different numbers of layers without having to rewrite the code.

saifkhanengr · December 1, 2022, 5:15am

Yes. I corrected it (0.01 in both places) but the result seems like an old one.
The main player which makes the result a good fit is adding 1./m to the dAL or dA2 in a 2-layer case. But I will check what you mention in detail later.
Thanks a million, sir.

saifkhanengr · December 3, 2022, 8:12am

Hello Sir! I hope you are doing well.
I’ve taken course 1 of DLS again and created a general L-Layers NN model for regression problems.

This (1/m) occurs with W and b only. I totally understand that. However, I don’t know why my NN model performs poorly with dA^{[L]} = A^{[L]}-Y and performs good with dA^{[L]} = (1./m)*(A^{[L]}-Y).

Sir! If I send you that notebook and you look it up and check where I am making mistakes (without correcting them). And then list my mistakes here. It would be extremely helpful. But my request is to only find the faults, not to correct them.

Saif.

rmwkwok · December 3, 2022, 8:41am

1/m effectively reduces the size of the gradient. Without it, it’s like you are using a very large learning rate and using a large learning rate can cause your model to diverge.

Raymond

PS: Let’s recall our vanilla gradient descent formula:

w := w - \alpha\frac{\partial{J}}{\partial{w}}

Topic		Replies	Views
What is the role of ReLu derivative? Neural Networks and Deep Learning week-3 , coursera-platform	3	291	May 4, 2024
Week 4, Last assignment / General question Neural Networks and Deep Learning coursera-platform	2	538	December 5, 2021
Clarification of the Derivative of the Log Loss Function Neural Networks and Deep Learning coursera-platform	2	964	April 17, 2022
Backpropagation formulas Neural Networks and Deep Learning coursera-platform	7	1047	April 21, 2021
Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks Neural Networks and Deep Learning coursera-platform	2	677	May 17, 2023

Derivative of Relu in output layer

Related topics