Question about the dropout process

Hi, say keep_prob = 0.5, for a hidden layer with 5 hidden units, the chance that all the hidden units are (0.5)^5 = 0.03125. In 100 iterations it will happen ~3 times. Is the whole hidden layer shutdown then? How does the algorithm deal with this?

Hello @1492r,

If you have 100 samples and 5 features each, then the chance for all of them to โ€œshutdownโ€ is 0.5^{100 \times 5}. The sample size was missed out from the calculation.


Thanks, Raymond for your reply!
First I understand that the keep_prob is for neurons, not samples. In the programming assignment, it says:
โ€œAt each iteration, you shut down (= set to zero) each neuron of a layer with probability 1โˆ’๐‘˜๐‘’๐‘’๐‘_๐‘๐‘Ÿ๐‘œ๐‘ or keep it with probability ๐‘˜๐‘’๐‘’๐‘_๐‘๐‘Ÿ๐‘œ๐‘ (50% here). The dropped neurons donโ€™t contribute to the training in both the forward and backward propagations of the iteration.โ€

However, I assume I understand that the probability that you calculated is the prob that one hidden layer is shut down for all the samples, right?
But what happens if one hidden layer is shut down for just one sample? Does it occur at (0.5)^5? If it does, does the algorithm just ignore that sample?

May be the algorithm works different. In your scenario, the algorithm is not controlling the case of all units shutting off. I donโ€™t know the details of the implementation but if I were to code it I would probably do:

  1. Calc total number of units to shut down. On keep prob=0.5 and units=5 that could be 2 or 3.

  2. For loop from 1 to qtyUnitsToShutOff:

  • Get one random unit from all alive units
  • Shut it off

With this hypothetical algorithm your case would necer happpen.

What do you think?

Hello @1492r,

I think this note that you have quoted from the lab, and Juanโ€™s example have the same idea that on average, a neuron has 1-keep_prob chance to be turned off. This statement is fine no matter it is from the perspective of the whole set of data, or from just one sample.

However, I also want to draw our attention to this slide:

Obviously, the np.random.rand function generates one random number per feature (a3.shape[1]) and per sample (a3.shape[0]). This is why each sample can see different neurons being turned off, and that explains my earlier 0.5^{100\times5}.

As you said, there is a non-zero chance that a sample can end up with all neuronsโ€™ outputs being zero. However, that does not mean the algorithm will ignore that sample, that sample just becomes all zero instead of disappeared.

You can say that all zeros is bad because it can cause large error at the end when computing its loss. You can also say it is bad because the model almost wonโ€™t learn anything from that sample.

Yes, it is bad, but if it is just one sample out of many others, then the problem is relatively small. Also, this is why we wonโ€™t easily set a very low keep_prob (imagine, you can indeed set it to 0.00001). Your 0.5 is not exceptionally very low (though I wouldnโ€™t use 0.5 in the case of just 5 neurons), and the 0.5^{100\times5} is not very high either. Do you get the idea?


If the major implementations (tensorflow , keras, for instance) didnโ€™t account for this case, in the event that all neurons are shut off, wouldnโ€™t the rescaling (inverted dropout >> 1 / (1-keep_prob)) lead to zeros? And in such case, the only survivor would be the โ€˜bโ€™ bias which in some cases is not even usedโ€ฆ interesting. Thoughts?

Looking at the source code of tensorflow, thereโ€™s a comment in the dropout implementation (line 5500) that says:

ValueError: If `rate` is not in `[0, 1)` or if `x` is not a floating point
  tensor. `rate=1` is disallowed, because the output would be all zeros,
  which is likely not what was intended.

So it seems the Tensorflow developers thought about it.

However, we can still set the rate to 0.9999 which makes all outputs be zeros. I think the responsibility to determine a good rate is on us.

Indeed. Thatโ€™s why I said โ€œthe model almost wonโ€™t learn anything from that sample.โ€. In that case, only some biases can be updated but the weights canโ€™t.

I think the case we are discussing here (5 neurons, keep_prob=0.5) is not really close to the actual use case. We probably wonโ€™t use dropout if there are just 5 neurons.

1 Like

Thank you, Raymond and Huan, for the quick reply and clear explanation!
Sorry for a bit late reply on my side as weekdays usually are not a good time for Courseraโ€ฆ
I understand it now. But mind you explain why โ€œthe only survivor would be the โ€˜bโ€™ biasโ€? Once all neurons are shut down, wouldnโ€™t we simply get 0s as we multiply "a"s with 0s?
Thanks again, guys!

Hello @1492r,

No problem at all!

At least we will have some surviving bias terms in the output layer. It is the last layer, so nobody is going to multiply any zero to them.


I see. Thanks, Raymond!