I have a question regarding the regularization programming assignment from week 1. I understand that when using dropout, we divide A1 and A2 by keep_prob. I am wondering why we have to explicitly divide dA1 and dA2 by keep_prob. Aren’t the derivatives automatically scaled by (1/keep_prob) since A1 and A2 were scaled similarly? It seems like dA1 and dA2 are getting scaled by (1/keep_prob) twice.

I am also not sure why we explicitly need to apply the D1 and D2 masks to dA1 and dA2. Shouldn’t the derivatives automatically be scaled by the masks since we have scaled A1 and A2 by the same masks?

The key point is that the derivative there is not a derivative of A1 or A2, right? You have to remember what Prof Ng’s notation means: the gradients are derivatives of the cost, J, w.r.t. the variable in question. And exactly for that reason note that dA1 does not depend on A1: it depends on the values further “upstream” between A1 and J. This is the Chain Rule being “acted out”. Recall that:

Notice that nothing there will zero any of the gradient elements that are about to be applied to A^{[l-1]}. So in order to get the dropout effect on back propagation, we have to apply it in both directions in the same way.

Okay I think see how I am getting confused. I realize we have explicitly coded dA[l-1] = W[l]T x dZ[l] so the extra factors of D[l-1] and 1/keep_prob need to be explicitly coded in.

Yes, if we are doing dropout at more than one layer, then we will get more than one factor of \frac {1}{keepprob} from the action of the Chain Rule, but that’s as it should be, since we got the same number of such factors on forward prop as well. And likewise with the applications of the masks.