During the week 1 lecture series where professor andrew introduces the concept of regularization, he talks about using “Dropout” as a technique for the same. Now, while implementing dropout at any layer “l”, we first calculate a dropout matrix for that layer and then multiply the same to A[l] (activations of that layer) to zero out the effect of certain neurons. Once, we are through with this, professor andrew also mentions the significance of rescaling the A[l] matrix by performing element wise division with the “keep-probs” variable. I would really appreciate if someone could elaborate on the re-scaling procedure and explain how does dividing with “keep-probs” variable help.
Thankyou.
1 Like
Hi, @jaylodha.
By dropping some units during training you are changing the expected values of the activations with respect to test time. To compensate for this, you can either scale down the weights at test time (multiplying by keep_probs
) , or scale up the activations at training time (dividing by keep_probs
).
You have all the details here.
Happy learning!
4 Likes
To be more specific, at training time you’re multiplying the activations with a vector of independent Bernoulli random variables whose expected value is precisely keep_probs
, so you divide by keep_probs
to compensate for this.
Let me know if that helped.
2 Likes
Hey, thanks a lot for the explanation. It was highly insightful.
1 Like