Gradients with dropout

Liviu_Marian_Mircea · July 27, 2023, 5:30pm

In backward propagation gradients for the same dropped units (same for forward step) are zeroed out …I understand this… but the other gradients are scaled up by 1/keep_prob factor . WHY ?

In the forward step the by the following 2 relations
A = np.multiply(A,D)
A = A * 1/keep_prob
the output of a layer aka A matrix, have already incorporated 1/keep_prob factor
How to justify this relation dA = 1/keep_prob * dA because A is not the original output(wihout dropout) it’s the “masked” output A = A *D and A = A * 1/keep_prob ?

Thank you !

paulinpaloalto · July 27, 2023, 6:10pm

The gradients are just the derivatives of the forward functions, right? And the forward function is scaled up by that same factor. That is the so-called “inverted” dropout. Prof Ng explained why this is done in the lectures: You will eventually be running the network in prediction mode with the trained weights and no dropout (all forms of regularization only happen at training time). So if you don’t compensate for the dropout, the weights of the subsequent layers will be trained on lower activation outputs than they are really getting when applied in “prediction” mode. Here’s a thread which discusses this in more detail. To get the full picture, you’ll need to read all the posts on the rest of that thread from that point forward.

Liviu_Marian_Mircea · July 28, 2023, 4:43am

Somehow I understand the so called “compensation” step , that was not the idea of my topic…My question is why the same “compensation” step must be performed at backpropagation ?

At ex 4 week 1 they say
" During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if 𝐴[1] is scaled by keep_prob, then its derivative 𝑑𝐴[1] is also scaled by the same keep_prob)."

What I understand is this
Let f be a function of x and k a constant , than d(kf(x))/dx = k * d(f(x))/dx right ?
But if f (x) = kf(x) than d f(x) /dx cannot be a * d(f(x))/dx because I overwrite f function .
That’s what happen at function def forward_propagation_with_dropout
every layer’s output is cached in the form A2 = A2 * 1/keep_prob (layer 2 in this case) .
At calculation gradient of a layer’s output I understand that all dropped nodes must have zero gradient so the step dA2 = np.multiply(dA2,D2) is justified ,where D2 is the mask vector but I don’t understand why dA2 should be rescaled dA2 = 1/keep_prob * dA2 ?

paulinpaloalto · July 28, 2023, 5:35am

I guess I must be missing your point. You just proved to yourself why you need that factor in back propagation: if the forward function has that factor, then the derivative will also have the factor. It’s just calculus. But remember that it’s the chain rule and this is a huge layered composition of functions. The actual “dropping” is handled by the mask matrices, but those are also part of the back prop formulas, right? Unless you’re in the “without dropout” case. But also remember that in the case you are running the model without dropout, then keep_prob = 1, right?

Deepti_Prasad · July 28, 2023, 6:48am

the neurons are considered zero during backpropagation as well. Otherwise dropout wouldn’t do anything! Remember that forward propagation during training is only used to set up the network for backpropagation, where the network is actually modified (as well as for tracking training error)

but the other gradients are scaled up by 1/keep_prob factor . WHY ?
the reason for this is
it’s important to account for anything that you’re doing in the forward step in the backward step as well – otherwise you are computing a gradient of a different function than you’re evaluating.

In forward propagation, inputs are set to zero with probability p, and otherwise scaled up by 1/1−𝑝

In backward propagation, gradients for the same dropped units are zeroed out; other gradients are scaled up by the same 1/1−𝑝.

Hope it clarified your doubt!!!

Regards
DP

Liviu_Marian_Mircea · July 28, 2023, 1:51pm

I finally got it…It’s the chain rule … multiplying the layer’s ouput by the inverse of
keep_prob at forward step - ex: A1 = A1 * 1/keep_prob- it’s like adding a new step in this chain of functions composition . This is somehow hidden because the result is saved in the same variable A1 . Chain rule is like dz/dy * dy/dx …
Actually what code does something like A_1 = A1 * 1/keep_prob then at backprop
dC/dA_1 *d_A_1/dA1 where C is the Cost function…

The sentence “it’s important to account for anything that you’re doing in the forward step in the backward step as well – otherwise you are computing a gradient of a different function than you’re evaluating.” enlightened me

Thank you guys !

Topic		Replies	Views
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun	5	825	November 19, 2024
C2W1: Programming Assignment on Regularization Improving Deep Neural Networks: Hyperparameter tun	3	371	October 2, 2023
Backpropagation with Dropout Improving Deep Neural Networks: Hyperparameter tun	5	623	April 19, 2023
Inverted Dropout - Query Improving Deep Neural Networks: Hyperparameter tun	1	648	June 4, 2022
[C2W1 - Regularization] A question about inverted dropout scaling factor Improving Deep Neural Networks: Hyperparameter tun	3	1064	January 27, 2024

Gradients with dropout

Related topics