During the week 1 lecture series where professor andrew introduces the concept of regularization, he talks about using “Dropout” as a technique for the same. Now, while implementing dropout at any layer “l”, we first calculate a dropout matrix for that layer and then multiply the same to A[l] (activations of that layer) to zero out the effect of certain neurons. Once, we are through with this, professor andrew also mentions the significance of rescaling the A[l] matrix by performing element wise division with the “keep-probs” variable. I would really appreciate if someone could elaborate on the re-scaling procedure and explain how does dividing with “keep-probs” variable help.
Thankyou.
1 Like
Hi, @jaylodha.
By dropping some units during training you are changing the expected values of the activations with respect to test time. To compensate for this, you can either scale down the weights at test time (multiplying by keep_probs
) , or scale up the activations at training time (dividing by keep_probs
).
You have all the details here.
Happy learning! ![:slight_smile: :slight_smile:](https://emoji.discourse-cdn.com/google/slight_smile.png?v=9)
4 Likes
To be more specific, at training time you’re multiplying the activations with a vector of independent Bernoulli random variables whose expected value is precisely keep_probs
, so you divide by keep_probs
to compensate for this.
Let me know if that helped.
2 Likes
Hey, thanks a lot for the explanation. It was highly insightful.
1 Like
Hi!
Scaling the activation vector makes sense. But I have a question about also scaling dA during backprop as mentioned in the lab:
" 1. During forward propagation, you had divided A1
by keep_prob
. In backpropagation, you’ll therefore have to divide dA1
by keep_prob
again (the calculus interpretation is that if 𝐴[1] is scaled by keep_prob
, then its derivative 𝑑𝐴[1] is also scaled by the same keep_prob
)."
Why do we also have to scale dA? I really don’t know where to even start thinking about this. I would have assumed that the scaling would carry over during in the computation graph when gradients are calculated and hence it would not be needed to apply scaling again.
Thank you !
All backprop does is take the derivatives of the functions in forward prop and apply them. If the forward function has a factor of 1/keep_prob
, then its derivative also has the same factor, right? If we have a function:
g(z) = a * f(z)
where a is a constant, then we have:
g'(z) = a * f'(z)
That’s just basic calculus and of course it applies even in the multivariate and vector case.
3 Likes