Derivative of Relu in output layer

Thank you, Raymond, for responding to me.

Yes, you are right, and I am using that 1/m with W and b. But the problem is dA^{[L]}. According to Prof. Andrew, Paul, and to my understanding, we should use:
dA^{[L]} = A^{[L]} -Y. But this led to poor fit.

However, dA^{[L]} = (1./m)*(A^{[L]} -Y) led to good fit but we shouldn’t use 1./m here.

Now the question is: why dA^{[L]} = A^{[L]} -Y lead to poor fit?
Answer: Maybe there are some other bugs in my code.

1 Like

If your dA does not have 1/m, then do your dw and db have 1/m?

1 Like

Yes, dw and db have 1/m.

1 Like

Oh, then dA shouldn’t need 1/m.

What is the value of m?

What’s your learning rate then? Would that be just too large?

1 Like

m is 1000 and learning rate is 0.0075.

2 Likes

Hey @saifkhanengr, now I can read your code. If you can share it with me in the next 30 mins, I can read it right the way, otherwise I will read it tomorrow the first thing. (I am in Hong Kong time)

Of course, if other mentors can read it for you, then it’s fine too. :wink:

Cheers,
Raymond

2 Likes

Thank you, Raymond, for agreeing to read my code. Notebook has sent.

1 Like

You are welcome. @saifkhanengr :wink:

2 Likes

Hey @saifkhanengr,

It took me some time to test out something before I can suggest you what to do next in the last part of this reply.

Change to your code

I removed 1/m from dA[2].

Verifying your work

I am lazy, so the way I verify your work is by comparing it with Tensorflow:

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(7, activation='relu', input_shape=(1,)),
    tf.keras.layers.Dense(1, activation='relu'),
])

# initialize weights the way you do
np.random.seed(1)
parameters = initialize_parameters_deep(layers_dims)
w = [parameters['W1'].T, parameters['b1'].T[0], parameters['W2'].T, parameters['b2'].T[0], ]
model.set_weights(w)

# SGD could be same as our vanilla gradient descent
model.compile(optimizer=tf.keras.optimizers.experimental.SGD(learning_rate=0.0075), loss='mse')

# fit
h = model.fit(X.T, Y[0, :], epochs=5000, batch_size=X.shape[1], verbose=0, shuffle=False)
print(h.history['loss'][-1])
plt.scatter(X[0, :], Y[0, :])
plt.scatter(X[0, :], model(X.T))

Results:

So, your work is pretty much like TF result! So, your code looks OK! I won’t expect them to be exactly the same.

Conclusion

learning_rate = 0.0075 is too large

Some further study

  1. set learning_rate = 0.0075
  2. Modify your L_layer_model to print some grads:
def L_layer_model(...):
    ...
    for i in range(0, num_iterations):
        ...
        if print_cost and i % 1000 == 0 or i == num_iterations - 1:
            # print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
            print(grads['dW1'].T)
  1. Results:
[[ 0.    3.61 -2.82 16.52  0.   -3.64  0.  ]]
[[   0.   -131.81    0.      0.      0.      0.      0.  ]]
[[0.   0.07 0.   0.   0.   0.   0.  ]]
[[0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0.]]
[[0.00e+00 5.23e-06 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]]
  1. Observation: dW1 becomes almost always zero starting after 2000 iterations, so it’s not learning anymore.

Why it stops learning?

  1. ReLU.
  2. The gradient descent algorithm and our learning rate (0.0075) leads us there.

Since I don’t have code bugs to list out, please try to brainstom some explaination around the above 2 points, based on your observations, experiments, and knowledge.

However, for (1), please try to replace relu with leaky_relu but keep using learning_rate = 0.0075. For (2), first I want to say @saifkhanengr, welcome to the world of deep neural network where the cost surface is no long simple and convex, second, try to modify my tensorflow code and use Adam as the optimizer instead, and keep the learning_rate = 0.0075. You would have come across about Adam in DLS Course 2 Week 2.

Lastly, try experimenting more, and don’t limit to my suggestions. Also I only printed grads[‘dW1’], but you should try the others to make sure I am not fooling you by picking the only problematic one. Trust your observation but not mine :stuck_out_tongue_winking_eye: .

Cheers,
Raymond

PS: personally, I think if coding a L-Network worths 100 experience points, I think answering yourself why learning_rate = 0.0075 is problematic worths 5000 experience points. :wink:

PS2: It’s sleeping hours in my timezone, so I will come back tomorrow.

3 Likes

Hello Raymond! First, I wanted to express my gratitude for your help and support. Extremely thankful…

Yes, I checked that. Not only dW1 but other parameters too (db1, dW2, db2) also become zero with lr = 0.0075 and relu at the last layer. Similarly, leaky_relu and lr = 0.0075 leads to zero (except a few values); and also, a poor fit.

I checked relu and leaky_relu with lr = 0.0000075 and both led to the same result (good fit). My question is: Your selection of lr = 0.0000075 was a random choice? Because I checked lr = 0.00000075 (adding one more zero) led to a poor fit.

I think with experience, I’ll be able to do this.

I tried this but it led to a poor fit and made all parameters = zero (except a few values) after 2000 iteration. I don’t know why leaky_relu didn’t fit. TensorFlow also led to poor fit with leaky_relu and lr = 0.0075

Thank you, Raymond. I wasn’t scared; hahaha.

I will try this problem with Adam (in numpy) as I am weak at tensorflow.

Yeah, this one is necessary. Trying and learning. . .

Wow. deep maths is involved in this.

Good night, Raymond. You already provided a detailed guide. Have sweet dreams.

1 Like

No, it was because you said your value of m was 1000. So that’s exactly the effect caused by adding the extra factor of \frac {1}{m} to your dAL value: you reduce all the gradient values by a factor of \frac {1}{1000}, right? Which would be equivalent to leaving dAL alone and dividing the learning rate by 1000.

This is all just the Chain Rule, right? So dAL is a factor in every gradient value. Look carefully at all the general formulas for back prop and watch what happens with each layer: the dZ value from the previous layer is one of the factors, right?

2 Likes

Thanks very much for all your careful analysis, Raymond! And for bringing TF into the picture as a comparison. That’s really valuable as a way to see if the problem is bugs in the code or just hyperparameter tuning issues.

BTW I think there’s a bit more subtlety to the question of whether the TF result with lr = 0.0000075 is really equivalent to the result Saif gets with the same LR. Notice that the problem for Saif’s results was never the values with x > 0: it’s the left side of the graph where the y values are negative. BTW I claim ReLU is never going to work as the output activation here, precisely because the function has negative output values over part of the domain. Notice that on Saif’s graph, the \hat{y} values are all zero until x = -\frac{3}{4}. But you used ReLU as the output activation, so I’m not sure I believe your graph. Are you sure you used the same input data? Or maybe the resolution of the graph is too low to really see what’s happening for x < 0 in enough detail.

2 Likes

Exactly as @paulinpaloalto’s explained. Therefore, the hint is from you.

leaky relu isn’t that bad, look at my results:

I have made these changes, have you done the same:

    # dAL = (1./m)*(AL-Y) #commented this
    dAL = AL-Y #uncommented this
  1. change “relu” to “leaky_relu”. There are 4 places to change. 4 places :wink:

That’s nice!

Deep and dark.

Cheers,
Raymond

2 Likes

Hello Paul @paulinpaloalto ,

Yes!

Would you be talking about this graph?

Saif generates data with these codes,

X = np.random.randint(0,20, size=(1, 1000))
Y = 3*X**2 + 3*X

Since X is (before normalization) always positive, so all y labels are also positive. X only appears negative after normalization.

The normalized X is in the range of -1.69 to +1.62. Below is the trained weights of my TF version of Saif’s model:

W1: [[11.42, -1.1 , -0.76, 7.33, 0.5 , -3.55, 14.46]]
b1: [ 4.38, 0.6 , 0.04, 10.97, -0.82, 0.8 , 7.38]

W2: [[12.02], [0.92], [ 0.19], [13.13], [-0.79], [ 1.64], [16.04]]
b2: [3.32]

Some neurons in layer 1 can handle positive normalized X, and some handle negative normalized X.

Cheers,
Raymond

2 Likes

If you add one more zero there, then you also need to add one more zero to the num_iterations. You know the reason, don’t you :wink: ? Smaller step sizes → more steps needed.

2 Likes

Hello! Thanks @paulinpaloalto and @rmwkwok for your detailed guidance.
I am extremely sorry for not replying. Unexpectedly, I have had a fever (and many other things) for three days but will get back to this soon.
Once again, thanks a million.
Saif.

2 Likes

Hello!
My numpy NN model matches every case of TF. Highly satisfied.

Oh, I got it, num_iterations and lr depend on each other. Thanks for that hint. And that is also a clever idea to match your results with TF.

I tried different values of m (10000, 100000) with lr = 0.0000075, num_iterations = 5000 and got good fits. Seems lr is independent of m. But how to choose the initial value for lr? You divide the initial value of lr by the factor of m (1/1000) which led to a good fit. Can we do this in every case? For example, choose the initial lr = 0.001 and check if not a good fit, then divide the lr by the factor of m while keeping num_iterations constant. Is this intuition correct, generally?

BTW, learning_rate = 0.00075 with num_iterations = 5000 also led to a good fit.

Similarly, is there any relation between choosing number of hidden layers (also number of neurons) and lr, num_iterations? Because learning_rate = 0.00075 with one hidden layer and num_iterations = 5000 led to a good fit. But with three hidden layers led to poor fit. And, learning_rate = 0.0000075 with three hidden layers and num_iterations = 5000 led to a good fit.

1 Like

lr depends on the cost space. The cost space depends on your cost function, your data, and your model assumption (NN architecture).

For the data, the dependence should be very weak as your m becomes large enough. For how large is large enough, I will leave it to you to find out.

For the model assumption, I have demoed that both (1) leaky_relu + 0.0075 and (2) relu + 0.0075/1000 are fine. Choices of leaky_relu or relu are different model assumptions, or different NN architectures.

As for the division by 1000, it is not something worth remembering. I only said relu + divided by 1000 is good. I never said relu + divided by 100 is bad, because I never tried. I will leave it to you to try. I tried 1000 first because of you. Even if relu + divided by 100 is bad, I don’t know if relu + divided by 500 is bad.

As for the initial lr, it is again something you need to try out. Starting with a value and then scaling it down, as you said, is a good idea. This is how I will usually do it too. However, as you become more experienced, you will have your own set of initial lr to start with. Remember, a good lr depends on 3 things, so there is no rule-of-thumb or any single good lr starting value for every problem.

Therefore, my answer for you is, show me your experiments rather than asking for a good value :wink: :wink: :wink:

Raymond

2 Likes

Thank you, Raymond, for your thorough and constant guidance.
Just one last question: what is the cost space? I had never heard this word before.

1 Like

Hi @saifkhanengr,

This is an example of a cost space. It is a 2 dimensional space (w is one dimension, b is another dimension). There is a value at each point in the space which is the cost value J(w, b).

Screenshot_20221210_065302

Raymond

2 Likes