Doubt about the implementation of inverted dropout

jaylodha · May 8, 2021, 10:31am

During the week 1 lecture series where professor andrew introduces the concept of regularization, he talks about using “Dropout” as a technique for the same. Now, while implementing dropout at any layer “l”, we first calculate a dropout matrix for that layer and then multiply the same to A[l] (activations of that layer) to zero out the effect of certain neurons. Once, we are through with this, professor andrew also mentions the significance of rescaling the A[l] matrix by performing element wise division with the “keep-probs” variable. I would really appreciate if someone could elaborate on the re-scaling procedure and explain how does dividing with “keep-probs” variable help.

Thankyou.

nramon · May 10, 2021, 9:03am

Hi, @jaylodha.

By dropping some units during training you are changing the expected values of the activations with respect to test time. To compensate for this, you can either scale down the weights at test time (multiplying by keep_probs) , or scale up the activations at training time (dividing by keep_probs).

You have all the details here.

Happy learning!

nramon · May 10, 2021, 9:59am

To be more specific, at training time you’re multiplying the activations with a vector of independent Bernoulli random variables whose expected value is precisely keep_probs, so you divide by keep_probs to compensate for this.

Let me know if that helped.

jaylodha · May 10, 2021, 5:55pm

Hey, thanks a lot for the explanation. It was highly insightful.

Tudorobretin · November 19, 2024, 3:27pm

Hi!

Scaling the activation vector makes sense. But I have a question about also scaling dA during backprop as mentioned in the lab:

" 1. During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if 𝐴[1] is scaled by keep_prob, then its derivative 𝑑𝐴[1] is also scaled by the same keep_prob)."

Why do we also have to scale dA? I really don’t know where to even start thinking about this. I would have assumed that the scaling would carry over during in the computation graph when gradients are calculated and hence it would not be needed to apply scaling again.

Thank you !

paulinpaloalto · November 19, 2024, 4:17pm

All backprop does is take the derivatives of the functions in forward prop and apply them. If the forward function has a factor of 1/keep_prob, then its derivative also has the same factor, right? If we have a function:

g(z) = a * f(z)

where a is a constant, then we have:

g'(z) = a * f'(z)

That’s just basic calculus and of course it applies even in the multivariate and vector case.

Topic		Replies	Views
Course 2 -- Week 1 -- Dropout Improving Deep Neural Networks: Hyperparameter tun	1	735	June 28, 2021
[C2W1 - Regularization] A question about inverted dropout scaling factor Improving Deep Neural Networks: Hyperparameter tun	3	1064	January 27, 2024
Gradients with dropout Improving Deep Neural Networks: Hyperparameter tun	5	665	July 28, 2023
Backpropagation with Dropout Improving Deep Neural Networks: Hyperparameter tun	5	623	April 19, 2023
Inverted Dropout - Query Improving Deep Neural Networks: Hyperparameter tun	1	648	June 4, 2022

Doubt about the implementation of inverted dropout

Related topics