Hi @S.hejazinezhad ,
Lets try to understand a3 /= keep-prob together. I’ll use my own words to try to help you gain intuition on this:
We know that with Dropout we shut down some nodes in the layer. For example, in a 50-unit layer, with a Dropout of 80%, we shut down 10 units and keep 40. It is important to understand here that we don’t physically remove the shot down units, but instead we set them to zero.
Next, we know that once the Dropout is effected, we then calculate Z = W * a + b. In the case of a3, we would be calculating Z4 = W4 * a3 + b4, right?
Remember that we have shut off 10 units on a3, so their contribution is zero.
And here comes a key question:
What will happen to Z4 if we use a3 with just “80% of its power”? it will certainly reduce the expected value of Z4, right?
How can we solve this?
Well, if we divide a3 by keep-prob, meaning, a3 /= keep-prob, then that will “give more power” to the 80% of active units, right?
Think about this: what happens when you divide 1 by 0.8? 1/0.8 = 1.25 … it is bumped up!
So when we do a3 /= keep-prob, or in the example, a3 /= 0.8, we are basically bumping up all the active units to ‘compensate’ for the missing units.
I hope this explanation gives you some intuition on the reason why we apply ‘inverted dropout’.
Juan