During the week 1 lecture series where professor andrew introduces the concept of regularization, he talks about using “Dropout” as a technique for the same. Now, while implementing dropout at any layer “l”, we first calculate a dropout matrix for that layer and then multiply the same to A[l] (activations of that layer) to zero out the effect of certain neurons. Once, we are through with this, professor andrew also mentions the significance of rescaling the A[l] matrix by performing element wise division with the “keep-probs” variable. I would really appreciate if someone could elaborate on the re-scaling procedure and explain how does dividing with “keep-probs” variable help.

Thankyou.

1 Like

Hi, @jaylodha.

By dropping some units during training you are changing the expected values of the activations with respect to test time. To compensate for this, you can either scale down the weights at test time (multiplying by `keep_probs`

) , or scale up the activations at training time (dividing by `keep_probs`

).

You have all the details here.

Happy learning!

4 Likes

To be more specific, at training time you’re multiplying the activations with a vector of independent Bernoulli random variables whose expected value is precisely `keep_probs`

, so you divide by `keep_probs`

to compensate for this.

Let me know if that helped.

2 Likes

Hey, thanks a lot for the explanation. It was highly insightful.

1 Like