Hello Rashmi! Thanks for your reply.
This is not a classification problem but a regression problem. My purpose is to practice NN with regression problems because I work as a Petroleum engineer where we have regression problems.
I know dZ = 1 is wrong as I don’t know its derivative. If you know, kindly let me know.
Got it, SaifKhanengr,
Alright. Great to hear that! I would leave this discussion open for Super Mentors, as they have larger experience in dealing with such exercises
But would definitely keep a track of this thread.
All the good luck!
I think your no_relu
and no_relu_backward
functions look correct to me. You’re right that what you’re effectively saying is that the activation function at the output layer is the “identity” function, so its derivative is 1 everywhere.
So I would have expected that to work. What type of error are you seeing with that code? Note that I would have thought that this was the easy part of converting from classification to regression. The hard part is finding all the places where the different cost/loss function needs to affect the code.
Eeeek! Sorry, I wasn’t thinking hard enough when I wrote the first response. Your no_relu_backward
is not correct. Remember that what we are implementing there is:
dZ = dA * g'(Z)
Meaning that we’re not just returning the derivative of the activation function. So the fact that the derivative is 1 does not mean that the return value is 1, right? It means it’s equal to dA.
Actually while you’re at it, I’d feel more comfortable if you did the assignment of
A = Z
in no_relu
with a method that produced a separate copy. They way you implemented it, A ends up being another reference to the same global object. I can’t think of a case in which the return value A is going to get modified, so it’s probably no harm done. But it just introduces some risk of later unpleasant surprises. There’s a reason why they did that np.array(..., copy = True)
there in the original code you are copying. Please see this post for more information about how object references work.
Thank you, sir, for your time and detailed comments. I understand and implement it.
I changed A = Z in no_relu
to A = Z.copy(). Furthermore, I corrected the derivative in no_relu_backward
, (dZ = dA).
I ran the code and got poor results with two errors (attached image with plt.title(“no_relu at last layer). Then I tried to debug by running every function separately in a separate notebook. Instead of running all the functions under the two_layer_model
function or L_layer_model
function, I ran them separately (one iteration) and it doesn’t give any error. Then I ran two_layer_model
for 10 iterations and got no error; then for 15 iterations and got these two errors. I don’t know how "overflow encountered in square cost = np.sum((AL-Y)**2)/(2m)”* and “invalid value encountered in add Z = np.dot(W,A) + b” occurs after 15 iterations.
Moreover, I also tried leaky_relu
. I define it like that:
def leaky_relu(Z):
A = np.maximum(0.01*Z,Z)
assert(A.shape == Z.shape)
cache = Z
return A, cache
And its derivative like that:
def leaky_relu_backward(dA, cache):
Z = cache
dZ = np.array(dA, copy=True) # just converting dz to a correct object.
dZ = np.where(Z > 0, 1, 0.05)
assert (dZ.shape == Z.shape)
return dZ
I got the derivative idea from here. Is my leaky_relu_backward
correct or not?
By the way, it also lead to poor result.
Furthermore, I tried, again, with relu
in the last layer, and got surprising results. The more the number of iterations, the poorer the fit. 800 iterations and 8000 iterations, as you can see in the attached images.
Aha. I just made one change to my code and got a good fit for relu
at the last layer and for no_relu
at the last layer. The change is the derivative dA2 (two-layers-NN model). Previously, it was:
dA^{[2]} = (A^{[2]}-Y)
But I changed it to:
dA^{[2]} = (1./m) *(A^{[2]}-Y)
Actually, I was checking opt_utils_v1a
file of the Optimization Methods (DLS course 2 Week 2 Assignment) and found that:
dz^{[3]} = 1./m * (a^{[3]} - Y).
I don’t know the background of that and how it works. But two things confuse me.
-
opt_utils_v1a
define the derivative of Z (dz3) and I did it with A (dA2) but got improved results. -
leaky_relu
still does a poor job.
No, sorry, but you made the exact same mistake that you made in your original implementation of no_relu_backward
. Remember that dZ is not just the derivative of the activation function, right? You’ve just returned g'(Z) again.
So, I need to do this: dZ = np.where(Z > 0, dA, 0.05*dA). Right?
Sure, that’s one way to do it. But doesn’t this look simpler and a more obvious translation of the formula we are trying to implement:
dZ = dA * np.where(Z > 0, 1, 0.05)
The other advantage of doing it that way is that you can eliminate copy of dA to dZ and all the np.array code. Simpler and clearer is better, right?
Thanks a million, sir, for your unlimited guidance and time. Got a good fit.
But did you notice that the first 6 values of X give y = 0 (in both no_relu
and leaky_relu
). Furthermore, adding more data (samples) leads to a poor fit in all cases (relu, leaky_relu, no_relu
). I am wondering why?
In the last, what do you say about that?
Well, notice that the points for which you get \hat{y} = 0 are almost exactly the points for which your function gives negative values, right? As I commented earlier on this thread, the minimum of your function is at x = -\frac{1}{2} and it hits zero at x = -1 and x = 0.
Also notice that the results from after you fixed leaky_relu_backwards
are the same as they were before, but you call them “good” after the fix. I don’t see any change in the graph, but maybe the resolution on the y axis is just too low to really see the behavior in detail, especially what is happening near 0.
The weird behavior you show where the output becomes basically a step function at high iterations indicates that there is still something wrong with your code as well.
You have to really look at the code carefully and understand what is going on. You have to be careful about mixing the pure general code from C1 W4 with any of the “hard-coded” implementations from C2 W1 and W2. There they are building fixed three layer networks and they wanted to keep the code as simple as possible, so they took some shortcuts with how they handled the factor of \frac {1}{m} that arises from the fact that the dW and db gradients are derivatives of J and everything else are derivatives of L. In the full formulas for back prop, note that Prof Ng only shows the factor of \frac {1}{m} in the formulas for dW and db, but you can sort of cheat and include it in the output layer dA value and it simplifies the code if you are doing the “hard-coded for 3 layers” approach. But if you are building fully general code based on linear_activation_backward
and all that, you would be wiser not to copy from the C2 implementations.
Oh, wait. I noticed another bug. Here’s your “backward” code:
So notice that you hard-code the leaky slope to 0.01 in one case and to 0.05 in the other case. If that’s still the way the code looks, I would expect that to cause problems. Why not make slope a parameter, the way I showed. That way it’s easier to get the code right and then the “hard-coding” only happens where you invoke the functions. It’s also better because you may want to test with different values of the slope, so making it a global variable and then passing that to both the “forward” and “backward” routines is clearly the way to go: you only have to change it in one place. Modularity is a Good Thing ™!
Of course that’s the old version of your code before you fixed the “just return the derivative” bug. Maybe you fixed both bugs at the same time?
Hello Sir! Thanks for your detailed guidance.
I am very new to Python, so, I did it in a similar way as in the assignment file.
So, adding that 1/m in the general code (L-layers) is not recommended. Right?
In the last, I would like to say that my mind is in a mishmash. Don’t know how to get that code to work well. So, I decided to retake DLS course 1, again, and then come back to this problem. But I am extremely indebted to you sir. You are an exceedingly kind and delightful teacher and do not get frustrated by my nonsense questions. Truly big-hearted.
That’s fine, but it still requires that you are consistent in the “slope” values that you use, right? Did you try fixing that to be consistent and see if it makes a difference?
There are two steps, right? First you have to understand what the math says. Then second you have to understand how the code implements what the math says. I don’t know if you really need to take all of Course 1 again, but that’s up to you. But at the very least, start by reviewing this slide (which is about 14:20 into this lecture in Week 3):
That’s where he gives the general formulas for back propagation at least in the 2 layer case. Notice where the factor of \frac {1}{m} occurs. You could change the formulas and put it only as a factor on dZ^{[2]} and eliminate it everywhere else. That would also be mathematically correct, but that’s not the way Prof Ng wrote the formulas.
Now study how the code works driven by L_model_backward
at the top level. Watch how that top level function calls linear_activation_backward
, which in turn calls linear_backward
and relu_backward
and sigmoid_backward. Where is each part of that math computation being done? Where do the factors of \frac {1}{m} show up?
I recommend that you just stick with doing everything in that coding style, rather than using the “hard-wired only 3 layers” style that you can see in Course 2. Once you get this working, you can then use it for other problems that happen to require different numbers of layers without having to rewrite the code.
Yes. I corrected it (0.01 in both places) but the result seems like an old one.
The main player which makes the result a good fit is adding 1./m to the dAL or dA2 in a 2-layer case. But I will check what you mention in detail later.
Thanks a million, sir.
Hello Sir! I hope you are doing well.
I’ve taken course 1 of DLS again and created a general L-Layers NN model for regression problems.
This (1/m) occurs with W and b only. I totally understand that. However, I don’t know why my NN model performs poorly with dA^{[L]} = A^{[L]}-Y and performs good with dA^{[L]} = (1./m)*(A^{[L]}-Y).
Sir! If I send you that notebook and you look it up and check where I am making mistakes (without correcting them). And then list my mistakes here. It would be extremely helpful. But my request is to only find the faults, not to correct them.
Saif.
1/m effectively reduces the size of the gradient. Without it, it’s like you are using a very large learning rate and using a large learning rate can cause your model to diverge.
Raymond
PS: Let’s recall our vanilla gradient descent formula:
w := w - \alpha\frac{\partial{J}}{\partial{w}}