Question about the dropout process

1492r · February 20, 2023, 12:06am

Hi, say keep_prob = 0.5, for a hidden layer with 5 hidden units, the chance that all the hidden units are (0.5)^5 = 0.03125. In 100 iterations it will happen ~3 times. Is the whole hidden layer shutdown then? How does the algorithm deal with this?
Thanks!

rmwkwok · February 20, 2023, 12:27am

Hello @1492r,

If you have 100 samples and 5 features each, then the chance for all of them to “shutdown” is 0.5^{100 \times 5}. The sample size was missed out from the calculation.

Raymond

1492r · February 20, 2023, 9:26pm

Thanks, Raymond for your reply!
First I understand that the keep_prob is for neurons, not samples. In the programming assignment, it says:
“At each iteration, you shut down (= set to zero) each neuron of a layer with probability 1−𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏 or keep it with probability 𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏 (50% here). The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.”

However, I assume I understand that the probability that you calculated is the prob that one hidden layer is shut down for all the samples, right?
But what happens if one hidden layer is shut down for just one sample? Does it occur at (0.5)^5? If it does, does the algorithm just ignore that sample?

Juan_Olano · February 20, 2023, 9:33pm

May be the algorithm works different. In your scenario, the algorithm is not controlling the case of all units shutting off. I don’t know the details of the implementation but if I were to code it I would probably do:

Calc total number of units to shut down. On keep prob=0.5 and units=5 that could be 2 or 3.
For loop from 1 to qtyUnitsToShutOff:

Get one random unit from all alive units
Shut it off

With this hypothetical algorithm your case would necer happpen.

What do you think?

rmwkwok · February 21, 2023, 2:36am

Hello @1492r,

I think this note that you have quoted from the lab, and Juan’s example have the same idea that on average, a neuron has 1-keep_prob chance to be turned off. This statement is fine no matter it is from the perspective of the whole set of data, or from just one sample.

However, I also want to draw our attention to this slide:

Obviously, the np.random.rand function generates one random number per feature (a3.shape[1]) and per sample (a3.shape[0]). This is why each sample can see different neurons being turned off, and that explains my earlier 0.5^{100\times5}.

As you said, there is a non-zero chance that a sample can end up with all neurons’ outputs being zero. However, that does not mean the algorithm will ignore that sample, that sample just becomes all zero instead of disappeared.

You can say that all zeros is bad because it can cause large error at the end when computing its loss. You can also say it is bad because the model almost won’t learn anything from that sample.

Yes, it is bad, but if it is just one sample out of many others, then the problem is relatively small. Also, this is why we won’t easily set a very low keep_prob (imagine, you can indeed set it to 0.00001). Your 0.5 is not exceptionally very low (though I wouldn’t use 0.5 in the case of just 5 neurons), and the 0.5^{100\times5} is not very high either. Do you get the idea?

Cheers,
Raymond

Juan_Olano · February 21, 2023, 4:04am

If the major implementations (tensorflow , keras, for instance) didn’t account for this case, in the event that all neurons are shut off, wouldn’t the rescaling (inverted dropout >> 1 / (1-keep_prob)) lead to zeros? And in such case, the only survivor would be the ‘b’ bias which in some cases is not even used… interesting. Thoughts?

Looking at the source code of tensorflow, there’s a comment in the dropout implementation (line 5500) that says:

ValueError: If `rate` is not in `[0, 1)` or if `x` is not a floating point
  tensor. `rate=1` is disallowed, because the output would be all zeros,
  which is likely not what was intended.

So it seems the Tensorflow developers thought about it.

rmwkwok · February 21, 2023, 6:03am

However, we can still set the rate to 0.9999 which makes all outputs be zeros. I think the responsibility to determine a good rate is on us.

rmwkwok · February 21, 2023, 6:19am

Indeed. That’s why I said “the model almost won’t learn anything from that sample.”. In that case, only some biases can be updated but the weights can’t.

I think the case we are discussing here (5 neurons, keep_prob=0.5) is not really close to the actual use case. We probably won’t use dropout if there are just 5 neurons.

1492r · February 24, 2023, 10:37pm

Thank you, Raymond and Huan, for the quick reply and clear explanation!
Sorry for a bit late reply on my side as weekdays usually are not a good time for Coursera…
I understand it now. But mind you explain why “the only survivor would be the ‘b’ bias”? Once all neurons are shut down, wouldn’t we simply get 0s as we multiply "a"s with 0s?
Thanks again, guys!

rmwkwok · February 25, 2023, 12:19am

Hello @1492r,

No problem at all!

At least we will have some surviving bias terms in the output layer. It is the last layer, so nobody is going to multiply any zero to them.

Raymond

1492r · February 28, 2023, 9:33pm

I see. Thanks, Raymond!

Topic		Replies	Views
Question about dropout process Improving Deep Neural Networks: Hyperparameter tun	1	502	February 20, 2023
A lecture issue in dropout regularization implementation in week 1 Improving Deep Neural Networks: Hyperparameter tun	7	713	December 9, 2022
Does Dropout implementation shut down random features and not random neurons? Improving Deep Neural Networks: Hyperparameter tun	1	544	August 7, 2021
Implementation keep_prob in dropout Improving Deep Neural Networks: Hyperparameter tun	4	709	July 4, 2021
Regularization Improving Deep Neural Networks: Hyperparameter tun	3	592	July 15, 2023

Question about the dropout process

Related topics