Week 1 - Doubt in Dropout Regularization lecture video

I have a doubt in the use of a scaling factor. In the attached I’ve taken a certain example and in that one can notice that values 14.1 and 13.1, which are very large compared to the rest, are getting dropped out. How does rescaling other activations, which are much smaller, compensate for dropping large activation values?
I intuitively feel that if dropout is used since the beginning of the training, then such a disparity between activation values shouldn’t be possible at any point of time. Am I right or if not what’s the right explanation?

Hi, @Prasannab.

The scaling factor restores the expected values of the activations. Think about it in a probabilistic context. A contrived example with a small sample size probably won’t give you a good intuition.

Let me know if this explanation helps :slight_smile:

1 Like

@nramon
Kindly apologize for such a long question, I am not able to succinctly articulate my doubt.

I saw that previous explanation before I posted my question because my doubt still persists. My doubt is that asymmetry in the weights is what helps a neural network predict correctly by giving importance to certain features more than others in every layer. I don’t understand how uniformly scaling all activations by keep_prob takes care of this asymmetry because some activations which randomly got chosen may be very large.
For example, if keep_prob = 0.5, and number of units in layer l is 100, and the 50 values which are dropped happen (by chance) to be 50 very large values in a[l] (in one of the iterations) and the other 50 activations are very small, how can doubling the values of the activations that are kept compensate for the dropped significant ones?
Is scaling activations by keep_prop a shallow attempt towards compensating for the dropped activations or is it a precise method which holds true for all cases?

No need to apologize!

The problem of symmetry is something different. Dividing by keep_prob is compensating for this:
image (source)

From what I understood, this is the formal justification, but an example may be clearer:

>>> m = 1000
>>> keep_prob=0.5
>>> r = np.random.binomial(1, keep_prob, m)
>>> y = np.random.normal(1, 0.1, m)
>>> np.mean(y)
1.0004794606768865
>>> np.mean(r * y)
0.5123941559332359
>>> np.mean(r * y / keep_prob)
1.0247883118664718

Let me know if something doesn’t make sense :slight_smile:

4 Likes

The other general thing to note, which is implicit in @nramon’s excellent explanations, is that all this behavior is statistical. The whole point is that different neurons get dropped on every sample on every iteration. On any given iteration, the compensation of dividing by keep_prob may not precisely counteract the actual dropped neurons’ values, but it could just as likely be an undershoot as an overshoot on any given sampling. It all comes out in the wash, statistically speaking, since what we are after is the aggregate effect after thousands or tens of thousands of training iterations.

4 Likes

Another more subtle point that I remember someone making about this a couple of years back is that the actual number of neurons that get dropped on any given iteration is also statistical. It may not be exactly equal to keep_prob * the total number of elements in the activation output. There may be “quantization” errors if the number of neurons is relatively small, although that’s probably not a major source of inaccuracy in “real world” scale networks. Even without the quantization errors, the output of the random function what we are using to do the masking is also statistical. The interesting point to consider is that you could actually compute the exact number of neurons that are dropped in a given iteration and use that to compute a more precise compensation instead of dividing by keep_prob. But there again the behavior is all statistical, so is it worth that extra compute to calculate the more precise divisor? Maybe an interesting thing to play with and see if you can detect a difference in behavior with the more precise method.

3 Likes

This makes sense to me now, even in my example a combination of all the significant activations getting dropped is quite not probable, so even if it happens once or twice it won’t cause much change as to affect the whole trend of the weight distribution. Thank you so much!
Interesting to think about quantization errors, will try it out!

Even in your example it is not exactly compensating for the dropped values but works reasonably fine to balance it. It clears it up, thank you!