Hi, say keep_prob = 0.5, for a hidden layer with 5 hidden units, the chance that all the hidden units are (0.5)^5 = 0.03125. In 100 iterations it will happen ~3 times. Is the whole hidden layer shutdown then? How does the algorithm deal with this?
Thanks!
Hello @1492r,
If you have 100 samples and 5 features each, then the chance for all of them to โshutdownโ is 0.5^{100 \times 5}. The sample size was missed out from the calculation.
Raymond
Thanks, Raymond for your reply!
First I understand that the keep_prob is for neurons, not samples. In the programming assignment, it says:
โAt each iteration, you shut down (= set to zero) each neuron of a layer with probability 1โ๐๐๐๐_๐๐๐๐ or keep it with probability ๐๐๐๐_๐๐๐๐ (50% here). The dropped neurons donโt contribute to the training in both the forward and backward propagations of the iteration.โ
However, I assume I understand that the probability that you calculated is the prob that one hidden layer is shut down for all the samples, right?
But what happens if one hidden layer is shut down for just one sample? Does it occur at (0.5)^5? If it does, does the algorithm just ignore that sample?
May be the algorithm works different. In your scenario, the algorithm is not controlling the case of all units shutting off. I donโt know the details of the implementation but if I were to code it I would probably do:
-
Calc total number of units to shut down. On keep prob=0.5 and units=5 that could be 2 or 3.
-
For loop from 1 to qtyUnitsToShutOff:
- Get one random unit from all alive units
- Shut it off
With this hypothetical algorithm your case would necer happpen.
What do you think?
Hello @1492r,
I think this note that you have quoted from the lab, and Juanโs example have the same idea that on average, a neuron has 1-keep_prob
chance to be turned off. This statement is fine no matter it is from the perspective of the whole set of data, or from just one sample.
However, I also want to draw our attention to this slide:
Obviously, the np.random.rand
function generates one random number per feature (a3.shape[1]
) and per sample (a3.shape[0]
). This is why each sample can see different neurons being turned off, and that explains my earlier 0.5^{100\times5}.
As you said, there is a non-zero chance that a sample can end up with all neuronsโ outputs being zero. However, that does not mean the algorithm will ignore that sample, that sample just becomes all zero instead of disappeared.
You can say that all zeros is bad because it can cause large error at the end when computing its loss. You can also say it is bad because the model almost wonโt learn anything from that sample.
Yes, it is bad, but if it is just one sample out of many others, then the problem is relatively small. Also, this is why we wonโt easily set a very low keep_prob
(imagine, you can indeed set it to 0.00001). Your 0.5 is not exceptionally very low (though I wouldnโt use 0.5 in the case of just 5 neurons), and the 0.5^{100\times5} is not very high either. Do you get the idea?
Cheers,
Raymond
If the major implementations (tensorflow , keras, for instance) didnโt account for this case, in the event that all neurons are shut off, wouldnโt the rescaling (inverted dropout >> 1 / (1-keep_prob)) lead to zeros? And in such case, the only survivor would be the โbโ bias which in some cases is not even usedโฆ interesting. Thoughts?
Looking at the source code of tensorflow, thereโs a comment in the dropout implementation (line 5500) that says:
ValueError: If `rate` is not in `[0, 1)` or if `x` is not a floating point
tensor. `rate=1` is disallowed, because the output would be all zeros,
which is likely not what was intended.
So it seems the Tensorflow developers thought about it.
However, we can still set the rate to 0.9999 which makes all outputs be zeros. I think the responsibility to determine a good rate is on us.
Indeed. Thatโs why I said โthe model almost wonโt learn anything from that sample.โ. In that case, only some biases can be updated but the weights canโt.
I think the case we are discussing here (5 neurons, keep_prob=0.5) is not really close to the actual use case. We probably wonโt use dropout if there are just 5 neurons.
Thank you, Raymond and Huan, for the quick reply and clear explanation!
Sorry for a bit late reply on my side as weekdays usually are not a good time for Courseraโฆ
I understand it now. But mind you explain why โthe only survivor would be the โbโ biasโ? Once all neurons are shut down, wouldnโt we simply get 0s as we multiply "a"s with 0s?
Thanks again, guys!
Hello @1492r,
No problem at all!
At least we will have some surviving bias terms in the output layer. It is the last layer, so nobody is going to multiply any zero to them.
Raymond
I see. Thanks, Raymond!