Hi Masters,

I am talking about the code below:

dA1 = np.multiply(dA1,D1)

dA1 /= keep_prob

dZ1 = np.multiply(dA1, np.int64(A1 > 0))

So, in the forward propagation, I need to zero out A1 with D1 (1 and 0 matrix), then /= Keep_prob. I have no problem here. But, now, I need to do the same thing for dA1 at backward propagation (the hint of hw only said do this for dA1, not dZ1). But the 3rd line shows me,

Q1: I need to do the same for dz1, yes or no ? and why?

Q2: In this case, np.int64(A1 > 0) is the same as D1, yes or no?

The line

Is just the realization of this math formula for the case that g(Z) is ReLU:

dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]})

You are multiplying by dA1 and you’ve already multiplied that by D1, so there is no need to do anything more to dZ1.

For question 2, no, that is not the same mathematically as D1: D1 is a random mask of 0 and 1’s. A1 might have some zero values that were zero before you multiplied it by D1, right? Maybe a subtle point and not very likely to be different, but it’s not mathematically the same thing. And what’s the point anyway: you already have that formula and don’t need to mess with it for purposes of dropout. Remember that this code all has to still work when dropout is disabled by setting `keep_prob = 1`

, right? We don’t use dropout in “prediction” mode: only during training. That is true for any form of regularization.

Hi Paul. I am getting there very soon. Thanks for your help! Allow me to have some follow up questions.

For Q1, I really need to confirm my understanding here. So, in this equation dZ1 above. Since the derivative of relu is just 1 or 0, that’s the reason the code “np.multiply(dA1, np.int64(A1 > 0))” ??? In other words, this np.int64(A1 > 0) is for “derivative of relu”, Yes or No ? If I didn’t clear out the my words, please let me know.

In other words again, if this is another activation function, this np.int64(A1 > 0) is wrong. Yes or No.

Yes, you are correct: the expression `np.int64(A1 > 0)`

is just the derivative of ReLU. If you chose to use a different activation function, then you’d need to change that code.

All this code is “hard-wired” to keep it simple to avoid distracting us from the dropout logic, which is what this is really about. Notice that it’s also hard-coded to a fixed number of layers, so there are no for loops over the hidden layers. If you were writing fully general code here that supported more than one activation function, then we’d need all the complexity we had in the Course 1 Week 4 Step by Step Assignment, right? `L_model_backward`

calls `linear_activation_backward`

, which calls `linear_backward`

and `sigmoid_backward`

or `relu_backward`

or `tanh_backward`

etc. That would just confuse the issue and distract us from what is new in this assignment.

The other reason that it would be a waste of effort to go for “full generality” in this assignment is that we will very soon (Course 2 Week 3) learn how to use TensorFlow to build our networks. TF takes care of the implementation details for us and we just have to put together the components (building blocks) we need to express the features we need in our network. Meaning that you will never actually need to build the general equivalent of this code to implement dropout: it will be taken care of for you by TF “under the covers”.

Master Paul, you are just sooooo kind, I have to say this.