Gradients with dropout

In backward propagation gradients for the same dropped units (same for forward step) are zeroed out …I understand this… but the other gradients are scaled up by 1/keep_prob factor . WHY ?

In the forward step the by the following 2 relations
A = np.multiply(A,D)
A = A * 1/keep_prob
the output of a layer aka A matrix, have already incorporated 1/keep_prob factor
How to justify this relation dA = 1/keep_prob * dA because A is not the original output(wihout dropout) it’s the “masked” output A = A *D and A = A * 1/keep_prob ?

Thank you !

The gradients are just the derivatives of the forward functions, right? And the forward function is scaled up by that same factor. That is the so-called “inverted” dropout. Prof Ng explained why this is done in the lectures: You will eventually be running the network in prediction mode with the trained weights and no dropout (all forms of regularization only happen at training time). So if you don’t compensate for the dropout, the weights of the subsequent layers will be trained on lower activation outputs than they are really getting when applied in “prediction” mode. Here’s a thread which discusses this in more detail. To get the full picture, you’ll need to read all the posts on the rest of that thread from that point forward.

Somehow I understand the so called “compensation” step , that was not the idea of my topic…My question is why the same “compensation” step must be performed at backpropagation ?

At ex 4 week 1 they say
" During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if 𝐴[1] is scaled by keep_prob, then its derivative 𝑑𝐴[1] is also scaled by the same keep_prob)."

What I understand is this
Let f be a function of x and k a constant , than d(kf(x))/dx = k * d(f(x))/dx right ?
But if f (x) = k
f(x) than d f(x) /dx cannot be a * d(f(x))/dx because I overwrite f function .
That’s what happen at function def forward_propagation_with_dropout
every layer’s output is cached in the form A2 = A2 * 1/keep_prob (layer 2 in this case) .
At calculation gradient of a layer’s output I understand that all dropped nodes must have zero gradient so the step dA2 = np.multiply(dA2,D2) is justified ,where D2 is the mask vector but I don’t understand why dA2 should be rescaled dA2 = 1/keep_prob * dA2 ?

I guess I must be missing your point. You just proved to yourself why you need that factor in back propagation: if the forward function has that factor, then the derivative will also have the factor. It’s just calculus. But remember that it’s the chain rule and this is a huge layered composition of functions. The actual “dropping” is handled by the mask matrices, but those are also part of the back prop formulas, right? Unless you’re in the “without dropout” case. But also remember that in the case you are running the model without dropout, then keep_prob = 1, right?

the neurons are considered zero during backpropagation as well. Otherwise dropout wouldn’t do anything! Remember that forward propagation during training is only used to set up the network for backpropagation, where the network is actually modified (as well as for tracking training error)

but the other gradients are scaled up by 1/keep_prob factor . WHY ?
the reason for this is
it’s important to account for anything that you’re doing in the forward step in the backward step as well – otherwise you are computing a gradient of a different function than you’re evaluating.

In forward propagation, inputs are set to zero with probability p, and otherwise scaled up by 1/1−𝑝

In backward propagation, gradients for the same dropped units are zeroed out; other gradients are scaled up by the same 1/1−𝑝.

Hope it clarified your doubt!!!

Regards
DP

1 Like

I finally got it…It’s the chain rule … multiplying the layer’s ouput by the inverse of
keep_prob at forward step - ex: A1 = A1 * 1/keep_prob- it’s like adding a new step in this chain of functions composition . This is somehow hidden because the result is saved in the same variable A1 . Chain rule is like dz/dy * dy/dx …
Actually what code does something like A_1 = A1 * 1/keep_prob then at backprop
dC/dA_1 *d_A_1/dA1 where C is the Cost function…

The sentence “it’s important to account for anything that you’re doing in the forward step in the backward step as well – otherwise you are computing a gradient of a different function than you’re evaluating.” enlightened me :smile:

Thank you guys !