Why do you divide the activations by keep_prob when you use drop

I know that several discussions are already done on this topic but I still don’t understand, so let me ask this question as a new topic.

Is scaling the activation units by 1/keep_prob to make the sum of the values of activation units that are retained after dropout the same as the sum of the activation values of all the activation units when dropout was not applied?

If so, how?
Because keep_prob is just a probability and I don’t understand how it has something to do with the correction of the sum of the activation units.

Thank you in advance.

If an activation layer were some thing like [[2, 4, 4, 2]] and keep_prob = 0.5,
the sum of the activation values before dropout is 12.
And after dropout, now the activation layer is [[0, 4, 4, 0]], and the sum is 8.
By divide the sum by 0.5, it becomes 16.

So it doesn’t result in the same as the sum of the activation values before dropout.

I’m no quite sure what value you want to keep the same by scaling the retained activation units by keep_prob.

Hello @ricky_pii,

Thank you for elaborating your question so clearly.

The key here is that we do not intend to keep the value exactly the same, and as you have exemplified, keeping it exactly the same is not quite possible.

Therefore, the idea is that we hope to make them roughly similar (as in your example, 12 and 16 are quite similar, ain’t they?). We want to keep them similar so that the activation values are not off by too much between training time and prediction time.

Let’s reuse your example, at training time, dropout is there, and if we pick any two numbers, the sum will always be < 10. However, at prediction time when dropout is not there, the sum will be > 10.

To the model, < 10 is the normal because it is so trained, but at prediction when dropout is lifted, the behavior becomes so difference, such behavioral shift can harm the model performance. Therefore, we want to avoid such shift by the 1/keep_prob factor to roughly scale it back - just roughly, and that is a simple and efficient way to do it.

Cheers,
Raymond

1 Like

Hi, Ricky.

It looks like you already found this thread, which has a pretty complete discussion of this point. Did you also read the later replies on that thread? E.g. this one, this one and this one?

Dear @rmwkwok ,

Thank you for your reply.
All the explanation I’ve seen so far said that 1/keep_prob is to make the sum of activation values after dropout ““the same”” as the sum without dropout.
Thank you for detecting that the word “roughly” was all I needed and repeating it in your explanation.

Do you consider the sum after dropout that has the same number of digits as the sum without dropout “roughly the same”?
Can you always achieve that kind of sum by scaling by 1/keep_prob?
Or should I just ignore the details and just know that I have to scale it by 1/keep_prob to make the sum after dropout somewhat larger?

Hello @ricky_pii,

Hah hah! You probably have known that we can’t. We can’t rely on just the 1/keep_prob to get back the same value, and we can’t rely on just the 1/keep_prob to guarantee us any such rule about how rough is rough.

What we also need is a large layer size. You see, dropout is a regularization technique, and we apply regularization when it is vulunerable to overfitting, which is usually possible when the nerual network is too large. The more usual scenario where dropout is applied is when our layers have a considerable number of units. Your example [2, 4, 4, 2] has only 4 units which is not large at all. It is a good example to illustrate your question but it is not a good example for a neural network that is vulunerable to overfitting.

If we consider a layer that has 2000 units and a reasonable keep_prob like 0.5 will leave us 1000 activation values to sum. With such a large number of values, there is a very good chance that applying 1/keep_prob will get us pretty close to the result without dropout. You might experiment this with some numpy random generator. Actually, if you really experiment, try not just 2000, but also 1000, 200, 100, 20, 10, and 4. You can see how the variation changes over the layer size.

Therefore, I am not going to say how rough is rough. I am not going to comment on whether having the same number of digits (same order of magnitude) is the best we can get. Because it depends on the layer size, and it cannot be guaranteed by 1/keep_prob anyway.

Cheers,
Raymond

PS: Regard to those explanations that claimed it to be the same, although we don’t know how they claim that, it is interesting to discuss how we can possibly achieve it. To achieve it, we need to compute the results without dropout, 2+4+4+2 = 12. Then compute the result with dropout, 2 + 2=4, and then compute the scaling facter which is 3, and then apply the scaling factor. By doing these steps, we double the computation cost. When we design computer algorithm especially for something like training neural network that is already very costly, without proving it worths doubling the cost, we just won’t take that way.

1 Like

Raymond has covered everything here, but where does it say “exactly the same”? Everything about dropout is statistical from the very beginning, right? You’re using a random process to generate the masks. If keep_prob is 0.65, the actual number of neurons that get zapped on a given iteration will not be exactly 65% every time. Note that whether you zap a given neuron is “quantized”. You can’t partially zap a given neuron: it’s either zapped or not and the size of the A matrix may not be evenly divisible in such a way that it may literally be impossible to zap exactly 65% of the neurons.

So everything about the behavior is statistical. If you read the threads I linked, notice that I mention the concept of “expected value” which is a standard thing in statistics multiple times and Prof Ng also mentions that in the lectures.

Here’s a little experiment I ran to get a sense for how the statistics play out:

np.random.seed(42)
keep_prob = 0.85
means = []
for ii in range(20):
    D = np.random.rand(10,10)
    D = (D < keep_prob).astype(float)
    Dmean = np.mean(D)
    means.append(Dmean)
    print(f"{ii}: mean(D) = {Dmean}, mean(means) = {np.mean(np.array(means))}")

When I run that, here’s the output:

0: mean(D) = 0.87, mean(means) = 0.87
1: mean(D) = 0.82, mean(means) = 0.845
2: mean(D) = 0.83, mean(means) = 0.84
3: mean(D) = 0.87, mean(means) = 0.8475
4: mean(D) = 0.78, mean(means) = 0.834
5: mean(D) = 0.81, mean(means) = 0.8300000000000001
6: mean(D) = 0.9, mean(means) = 0.8400000000000001
7: mean(D) = 0.85, mean(means) = 0.8412499999999999
8: mean(D) = 0.86, mean(means) = 0.8433333333333333
9: mean(D) = 0.88, mean(means) = 0.8470000000000001
10: mean(D) = 0.78, mean(means) = 0.8409090909090909
11: mean(D) = 0.79, mean(means) = 0.8366666666666666
12: mean(D) = 0.89, mean(means) = 0.8407692307692307
13: mean(D) = 0.81, mean(means) = 0.8385714285714286
14: mean(D) = 0.83, mean(means) = 0.838
15: mean(D) = 0.88, mean(means) = 0.840625
16: mean(D) = 0.83, mean(means) = 0.84
17: mean(D) = 0.89, mean(means) = 0.8427777777777777
18: mean(D) = 0.87, mean(means) = 0.8442105263157894
19: mean(D) = 0.91, mean(means) = 0.8474999999999999

So you can see what I mean about the stochastic behavior. Notice that the actual number of kept neurons on each iteration is all over the map: it ranges from 78% to 91% and hits exactly 85% only once in 20 tries, but the mean of those individual values per iteration approaches 0.85 as the number of samples increases. It’s the Law of Large Numbers in action! :nerd_face:

So you could actually do a more precise compensation: instead of dividing by keep_prob, you could divide by the actual fraction of non-zapped neurons on a given iteration. It’s an interesting question to consider why they didn’t do it that way, but my guess is that the extra compute cost of doing that doesn’t really buy you anything since all the behavior is statistical anyway and it all comes out in the wash with lots of iterations.

1 Like