I have been trying to wrap my head around this for a few days and can’t seem to grasp it thoroughly.
According to the lecture notes for C2W1 , the activations that remain after the dropout mask has been applied have to be scaled by the keep_prob
factor i.e. if keep_prob
is 0.5 and I had 4 units left, I will double the values for all of them.
This is also mentioned in the programming excercise for Regularization:
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
According to wikipedia, the expected value of a random variable is intuitively it’s arithmetic mean…
Going with that definition the expected value of 6 numbers will be:
E = (a1 + a2 + a3 + a4 + a5 + a6) / 6
If I ‘drop out’ 50% of them, and scale the remaining numbers, I end up with:
(a1=a2=a3=0 after Dropout)
E' = (2a4 + 2a5 + 2a6) / 6
E' = (a4 + a5 + a6) / 3
How is E’ equal to E? What am I misunderstanding here?