Backpropagation with Dropout

Damien_Guilbaud · April 18, 2023, 3:28pm

I have a question regarding the regularization programming assignment from week 1. I understand that when using dropout, we divide A1 and A2 by keep_prob. I am wondering why we have to explicitly divide dA1 and dA2 by keep_prob. Aren’t the derivatives automatically scaled by (1/keep_prob) since A1 and A2 were scaled similarly? It seems like dA1 and dA2 are getting scaled by (1/keep_prob) twice.

I am also not sure why we explicitly need to apply the D1 and D2 masks to dA1 and dA2. Shouldn’t the derivatives automatically be scaled by the masks since we have scaled A1 and A2 by the same masks?

TMosh · April 18, 2023, 6:51pm

Suppose you have a / k, and you need the partial derivative of this term.
Since k is a constant, the result is 1/k * da.

Does this help?

paulinpaloalto · April 18, 2023, 7:55pm

The key point is that the derivative there is not a derivative of A1 or A2, right? You have to remember what Prof Ng’s notation means: the gradients are derivatives of the cost, J, w.r.t. the variable in question. And exactly for that reason note that dA1 does not depend on A1: it depends on the values further “upstream” between A1 and J. This is the Chain Rule being “acted out”. Recall that:

dA^{[l-1]} = \displaystyle \frac {\partial J}{\partial A^{[l-1]}} = W^{[l]T} \cdot dZ^{[l]}

Notice that nothing there will zero any of the gradient elements that are about to be applied to A^{[l-1]}. So in order to get the dropout effect on back propagation, we have to apply it in both directions in the same way.

Damien_Guilbaud · April 19, 2023, 2:25pm

Thank you so much for your explanation! However, I am still a little confused. Attached is a pic of what is confusing to me.

Damien_Guilbaud · April 19, 2023, 2:50pm

Okay I think see how I am getting confused. I realize we have explicitly coded dA[l-1] = W[l]T x dZ[l] so the extra factors of D[l-1] and 1/keep_prob need to be explicitly coded in.

paulinpaloalto · April 19, 2023, 2:54pm

Yes, if we are doing dropout at more than one layer, then we will get more than one factor of \frac {1}{keepprob} from the action of the Chain Rule, but that’s as it should be, since we got the same number of such factors on forward prop as well. And likewise with the applications of the masks.

Topic		Replies	Views
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun	5	828	November 19, 2024
C2W1: Programming Assignment on Regularization Improving Deep Neural Networks: Hyperparameter tun	3	371	October 2, 2023
Gradients with dropout Improving Deep Neural Networks: Hyperparameter tun	5	719	July 28, 2023
Programming Assignment "Regularization": the backprop with dropout Improving Deep Neural Networks: Hyperparameter tun	1	490	September 4, 2022
Course 2 -- Week 1 -- Dropout Improving Deep Neural Networks: Hyperparameter tun	1	738	June 28, 2021

Backpropagation with Dropout

Related topics