Hello @ricky_pii,
Hah hah! You probably have known that we can’t. We can’t rely on just the 1/keep_prob to get back the same value, and we can’t rely on just the 1/keep_prob to guarantee us any such rule about how rough is rough.
What we also need is a large layer size. You see, dropout is a regularization technique, and we apply regularization when it is vulunerable to overfitting, which is usually possible when the nerual network is too large. The more usual scenario where dropout is applied is when our layers have a considerable number of units. Your example [2, 4, 4, 2] has only 4 units which is not large at all. It is a good example to illustrate your question but it is not a good example for a neural network that is vulunerable to overfitting.
If we consider a layer that has 2000 units and a reasonable keep_prob like 0.5 will leave us 1000 activation values to sum. With such a large number of values, there is a very good chance that applying 1/keep_prob will get us pretty close to the result without dropout. You might experiment this with some numpy random generator. Actually, if you really experiment, try not just 2000, but also 1000, 200, 100, 20, 10, and 4. You can see how the variation changes over the layer size.
Therefore, I am not going to say how rough is rough. I am not going to comment on whether having the same number of digits (same order of magnitude) is the best we can get. Because it depends on the layer size, and it cannot be guaranteed by 1/keep_prob
anyway.
Cheers,
Raymond
PS: Regard to those explanations that claimed it to be the same, although we don’t know how they claim that, it is interesting to discuss how we can possibly achieve it. To achieve it, we need to compute the results without dropout, 2+4+4+2 = 12. Then compute the result with dropout, 2 + 2=4, and then compute the scaling facter which is 3, and then apply the scaling factor. By doing these steps, we double the computation cost. When we design computer algorithm especially for something like training neural network that is already very costly, without proving it worths doubling the cost, we just won’t take that way.