Inverted dropout, killing nodes or stabbing training examples?

As I’m watching “Dropout Regularization” and “Understanding Dropout” lecturers something I couldn’t get my head around is, we are randomly shut off examples!!
So the main question would be am I right?! here’s what I mean:
As far as I understand the reason behind using this kind of regularization is to spread out the weights, or in other terms, make the model do more work out on the weak weights, like muscles in our body. kinda!
Hence each row correspond to a node and each column correspond to a different training example at any layer or layer 3 (as in lecture):
A^[3] = shape(n, m)
e.g.
Suppose A is:

 [[55 24  2]
 [49 31 83]
 [40 45 33]
 [27 76 38]
 [90 88 75]
 [12 35 30]
 [56 21 46]]

Which mean we have 3 examples, e.g. three cat picture. As we had in lecture d3 will be:

d3 = np.random.randn([7, 3]) < 0.8

Suppose our d3 will be such a matrix:

 [[False  True  True]
 [False  True False]
 [ True False  True]
 [False  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]]

… and with doing a3 * d3 we get something like:

[[ 0 24  2]
 [ 0 31  0]
 [40  0 33]
 [ 0 76 38]
 [90 88 75]
 [12 35 30]
 [56 21 46]]

And for curiosity purpose … (which it has nothing to do with my question) a/0.8 will be:

[[  0.    30.     2.5 ]
 [  0.    38.75   0.  ]
 [ 50.     0.    41.25]
 [  0.    95.    47.5 ]
 [112.5  110.    93.75]
 [ 15.    43.75  37.5 ]
 [ 70.    26.25  57.5 ]]

So if we consider just the first column (one example), we shut off nodes 1, 2, and 4. but for other training examples, we don’t do the same. why we shouldn’t just make d3 a one-column vector to do the same for examples of a node?

d3 = np.random.randn([7, 1])  < 0.8

and a3 * d3:

[[ 0 0  0]
 [ 0 0  0]
 [40  45 33]
 [ 0 0 0]
 [90 88 75]
 [12 35 30]
 [56 21 46]]

I mean what I hear as terminology, drop out the nodes wouldn’t be this?
Now I’m looking at my question… its seems a bit clear what’s going on, but why we don’t do the same thing on each node for all it’s training examples at some iteration?

Thanks in advance.

1 Like

This is a really interesting point that you have noticed: the way we implement dropout, it does not handle all the samples the same way in each iteration. I honestly forget whether Prof Ng makes this point in the lectures or not, but it’s clear from the way the instructions are written in the notebook that this is the way we are supposed to implement it. My intuition is that doing it this way weakens the effect of the dropout. If you used your method of making the “mask” value be a column vector and treating all the samples the same would make the effect more intense. Either way will probably work in the end, but perhaps you would need different keep_prob values to get the same result with the two different methods.

The other way to think about this is that we are typically doing minibatch gradient descent. Using the “per sample” dropout is effectively doing the dropout as if we were doing Stochastic Gradient Descent (minibatch with batchsize = 1). My interpretation is that doing this makes the hyperparameter of minibatch size and the hyperparameter keep_prob independent, meaning that you can tune one without also having to change the other. (This property of “orthogonality” of hyperparameters is highly desirable, as Prof Ng discusses in the section on how to systematically approach hyperparameter tuning.) If you implement it the way you’ve described where each sample in the minibatch is treated the same, then intuitively it seems like that gives some “coupling” between the keep probability value and the minibatch size. I don’t know if that intuition is correct and whether that was the motivation for doing it the way they did, but something to consider.

One other subtle point here is that the template code in the notebook sets the random seed in the actual forward propagation code for simplicity of grading and checking results. But doing it that way means that we literally get exactly the same dropout mask on every iteration, which is definitely not how dropout was intended to work: the behavior is supposed to be statistical. The better approach would be to set the seed in the test logic, not in the actual runtime code. We’ve reported this as a bug to the course staff.

4 Likes

Note that this is an experimental science: you could actually run a comparison between the given method (but with the random seed not set in the forward propagation code) and your proposed “all samples treated the same” method. It would be interesting to see if the intuition that your method would be more “intense” and thus require perhaps a higher keep_prob value in order to have the same effect plays out or not. Science! :nerd_face:

4 Likes

Excuse my late response, at the time I wasn’t able to grasp all you said and I needed to review some of the materials.
I appreciate both your replies particularly the second one; and finding the bug you’ve noticed was interesting to me.
To respond to your second answer, I’ll be pleased to try this out and share any result that I found either interesting or valuable.
Thanks

So on the second module assignment which is about dropout, I did the following changes:
In the forward_propagation_with_dropout we have the mask:
D1 = np.random.rand(A1.shape[0], 1)
and I changed D2 for the next layer as well as D1.

Coding Details

If A1 shape is (n, m) for example, n = 5, m=3:

d = np.random.rand(5, 1) < 0.5
d = d.astype(int)

We end up with something like:

array([[0],
       [0],
       [1],
       [0],
       [0]])

Thanks to the Python Broadcasting A1 * d:

array([[0.28123521, 0.25551862, 0.25786374],
       [0.297053  , 0.0089446 , 0.98717618],
       [0.56646064, 0.60184241, 0.53826007],
       [0.        , 0.        , 0.        ],
       [0.67295579, 0.534168  , 0.05507231]])

So we are shutting off all the samples for the same neuron.

Since D1 and D2 caches to the backpropagation function, we don’t need to change anything.
hope I got that right! this minor change is so simple.

I would say the two methods almost do the same thing until I notice something.

Here is the result of the Masked version (np.random.rand(A1.shape[0], 1)) compared to the original version (np.random.rand(*A1.shape)) with keeping the other things fixed except the keep_prob property:

Original Method:

details

keep_prob = 0.86

Cost after iteration 0: 0.6543912405149825
Cost after iteration 10000: 0.0610169865749056
Cost after iteration 20000: 0.060582435798513114

image

On the train set: Accuracy: 0.9289099526066351
On the test set: Accuracy: 0.95

Masked Method:

details

keep_prob = 0.91

Cost after iteration 0: 0.690437069943951
Cost after iteration 10000: 0.17615076457892637
Cost after iteration 20000: 0.16678299596707952

image

On the train set: Accuracy: 0.9383886255924171
On the test set: Accuracy: 0.96

So yeah… as we expected this method is more intense and needs to make keep_prob higher, meaning shuts less of the neurons off.
Minding one factor (randomization) could affect the results I noticed it was so much easier for me to find the hyperparameter keep_prob for the masked version. Hence I did many different runs to the program to find better keep_prob for both methods and make sure of it. Assuming the who wrote the assignment, found the best value for keep_prob before, for me even finding the value 0.86 was more difficult. So not only the result is better (at least a little) (On this dataset), finding the right value for keep_prob is easier too.

Here is the result of the different values of the keep_prob on both methods:
Note I did this a few times and I find a similar result.

accuracy on 30 different values of the keep_prob

Original version
# keep_prob: (on train set, on dev set)**
 0.8: (0.919431279620853, 0.93),
 0.8037931034482759: (0.9146919431279621, 0.92),
 0.8075862068965518: (0.919431279620853, 0.925),
 0.8113793103448276: (0.933649289099526, 0.945),
 0.8151724137931035: (0.9383886255924171, 0.945),
 0.8189655172413793: (0.9383886255924171, 0.94),
 0.8227586206896552: (0.9383886255924171, 0.95),
 0.8265517241379311: (0.933649289099526, 0.925),
 0.830344827586207: (0.9289099526066351, 0.925),
 0.8341379310344827: (0.9289099526066351, 0.93),
 0.8379310344827586: (0.9289099526066351, 0.925),
 0.8417241379310345: (0.9241706161137441, 0.92),
 0.8455172413793104: (0.933649289099526, 0.925),
 0.8493103448275863: (0.9383886255924171, 0.93),
 0.8531034482758622: (0.9289099526066351, 0.93),
 0.8568965517241379: (0.9289099526066351, 0.95),
 0.8606896551724138: (0.9289099526066351, 0.95),
 0.8644827586206897: (0.919431279620853, 0.94),
 0.8682758620689656: (0.933649289099526, 0.94),
 0.8720689655172414: (0.9289099526066351, 0.935),
 0.8758620689655172: (0.9241706161137441, 0.935),
 0.8796551724137931: (0.919431279620853, 0.945),
 0.883448275862069: (0.9146919431279621, 0.945),
 0.8872413793103449: (0.919431279620853, 0.94),
 0.8910344827586207: (0.933649289099526, 0.95),
 0.8948275862068966: (0.909952606635071, 0.935),
 0.8986206896551725: (0.9052132701421801, 0.92),
 0.9024137931034483: (0.9289099526066351, 0.955),
 0.9062068965517242: (0.933649289099526, 0.95),
 0.91: (0.9289099526066351, 0.925)

# on average: 
# (0.92685624, 0.936     )
Mask version
# keep_prob: (on train set, on dev set)**
0.85: (0.9383886255924171, 0.955),
 0.8537931034482759: (0.9383886255924171, 0.96),
 0.8575862068965517: (0.957345971563981, 0.93),
 0.8613793103448275: (0.9620853080568721, 0.935),
 0.8651724137931034: (0.957345971563981, 0.935),
 0.8689655172413793: (0.9620853080568721, 0.935),
 0.8727586206896552: (0.9620853080568721, 0.935),
 0.876551724137931: (0.9289099526066351, 0.945),
 0.8803448275862069: (0.9289099526066351, 0.945),
 0.8841379310344827: (0.9289099526066351, 0.945),
 0.8879310344827586: (0.9289099526066351, 0.945),
 0.8917241379310344: (0.9289099526066351, 0.945),
 0.8955172413793103: (0.9289099526066351, 0.945),
 0.8993103448275862: (0.9289099526066351, 0.945),
 0.9031034482758621: (0.9289099526066351, 0.95),
 0.9068965517241379: (0.9289099526066351, 0.95),
 0.9106896551724137: (0.9383886255924171, 0.96),
 0.9144827586206896: (0.9289099526066351, 0.96),
 0.9182758620689655: (0.9383886255924171, 0.955),
 0.9220689655172414: (0.9383886255924171, 0.96),
 0.9258620689655173: (0.943127962085308, 0.955),
 0.929655172413793: (0.933649289099526, 0.955),
 0.9334482758620689: (0.9383886255924171, 0.955),
 0.9372413793103448: (0.9383886255924171, 0.955),
 0.9410344827586207: (0.9478672985781991, 0.95),
 0.9448275862068966: (0.933649289099526, 0.955),
 0.9486206896551723: (0.933649289099526, 0.955),
 0.9524137931034482: (0.9289099526066351, 0.955),
 0.9562068965517241: (0.9383886255924171, 0.96),
 0.96: (0.9289099526066351, 0.95)

# on average: 
# (0.9382306477093206, 0.949333333333333)

Very cool! Thanks very much for doing the careful research here and documenting the results.

I just wanted to confirm one technical detail: you don’t show the full code (which is probably better anyway) and you don’t specifically mention the random seed issue. In all the experimentation here, did you do that with the setting of the random seed removed from the low level code? It would be fine to set it once per entire training session, but you want to make sure it’s not getting set on every iteration. Otherwise (as I commented earlier), it’s really defeating the true purpose of dropout.

1 Like

I didn’t remove it, including the forwarding part.
I set a random seed once per entire training session this time and I redo all I did before, although I didn’t expect any different results.
actually, I’m not sure what page I’m on so I try tried all.
As the result: not much different from the previous result, the two method works pretty fine.

Second Try Result

Original version:

(0.8, (0.943127962085308, 0.93)
(0.8037931034482759, (0.95260663507109, 0.94)
(0.8075862068965518, (0.943127962085308, 0.935)
(0.8113793103448276, (0.957345971563981, 0.94)
(0.8151724137931035, (0.957345971563981, 0.94)
(0.8189655172413793, (0.957345971563981, 0.935)
(0.8227586206896552, (0.95260663507109, 0.94)
(0.8265517241379311, (0.95260663507109, 0.935)
(0.830344827586207, (0.95260663507109, 0.935)
(0.8341379310344827, (0.943127962085308, 0.935)
(0.8379310344827586, (0.95260663507109, 0.94)
(0.8417241379310345, (0.9478672985781991, 0.935)
(0.8455172413793104, (0.9478672985781991, 0.935)
(0.8493103448275863, (0.9478672985781991, 0.935)
(0.8531034482758622, (0.9478672985781991, 0.94)
(0.8568965517241379, (0.943127962085308, 0.94)
(0.8606896551724138, (0.943127962085308, 0.94)
(0.8644827586206897, (0.95260663507109, 0.935)
(0.8682758620689656, (0.9478672985781991, 0.94)
(0.8720689655172414, (0.9478672985781991, 0.94)
(0.8758620689655172, (0.95260663507109, 0.935)
(0.8796551724137931, (0.9478672985781991, 0.935)
(0.883448275862069, (0.9478672985781991, 0.935)
(0.8872413793103449, (0.943127962085308, 0.925)
(0.8910344827586207, (0.95260663507109, 0.935)
(0.8948275862068966, (0.943127962085308, 0.935)
(0.8986206896551725, (0.957345971563981, 0.935)
(0.9024137931034483, (0.957345971563981, 0.93)
(0.9062068965517242, (0.957345971563981, 0.935)
(0.91, (0.9620853080568721, 0.93)

avg:

  • 0.9503949447077409 , 0.9358333333333332*

Masked Version:

(0.8, (0.95260663507109, 0.935)
(0.8037931034482759, (0.95260663507109, 0.94)
(0.8075862068965518, (0.95260663507109, 0.945)
(0.8113793103448276, (0.95260663507109, 0.935)
(0.8151724137931035, (0.957345971563981, 0.935)
(0.8189655172413793, (0.95260663507109, 0.945)
(0.8227586206896552, (0.95260663507109, 0.935)
(0.8265517241379311, (0.943127962085308, 0.94)
(0.830344827586207, (0.95260663507109, 0.945)
(0.8341379310344827, (0.9478672985781991, 0.94)
(0.8379310344827586, (0.943127962085308, 0.94)
(0.8417241379310345, (0.95260663507109, 0.945)
(0.8455172413793104, (0.9478672985781991, 0.935)
(0.8493103448275863, (0.95260663507109, 0.945)
(0.8531034482758622, (0.9478672985781991, 0.94)
(0.8568965517241379, (0.95260663507109, 0.95)
(0.8606896551724138, (0.95260663507109, 0.945)
(0.8644827586206897, (0.95260663507109, 0.935)
(0.8682758620689656, (0.95260663507109, 0.945)
(0.8720689655172414, (0.95260663507109, 0.935)
(0.8758620689655172, (0.95260663507109, 0.94)
(0.8796551724137931, (0.957345971563981, 0.945)
(0.883448275862069, (0.957345971563981, 0.945)
(0.8872413793103449, (0.95260663507109, 0.94)
(0.8910344827586207, (0.95260663507109, 0.945)
(0.8948275862068966, (0.9478672985781991, 0.94)
(0.8986206896551725, (0.9478672985781991, 0.935)
(0.9024137931034483, (0.95260663507109, 0.95)
(0.9062068965517242, (0.95260663507109, 0.945)
(0.91, (0.95260663507109, 0.95)

avg:

  • 0.9516587677725113 , 0.9415*

Thank you very much again for the doing all this detailed investigation and sharing the results! Very interesting. I guess we have to be a little careful not to read too much into the differences, since the difference between 95% dev accuracy and 96% dev accuracy should probably be considered “in the noise”. But as you say, it doesn’t seem to change the results much to “hoist” the random seed setting to the outer level.

Looking at the “original” per sample method with low level seed setting, the two fairly stable areas where the results are best are:

And for higher keep_prob values:

If we compare that to the “original” per sample with the high level seeds, the good areas have slightly lower dev accuracy values:

So basically the same range of keep_prob values, other than the two > 95% values with keep_prob in the 0.90 range on the first case.

But as I commented above, differences that small should probably be considered just noise. I think your conclusion that it works well either way is the correct one. Since we have two independent attributes (“per sample” or “per batch” and “high level seed” or “low level seed”), it seems right to conclude that any of the 4 combinations are very close in performance.

Interesting! Thanks very much for sharing your results!

1 Like

Sorry, could you explain why this method might need a higher keep_prob value please? Thank you in advance.

That was only an intuition, based on the idea that using the same mask for all the samples in a given batch would have a more “intense” effect. Note that the experiments Passhbi shows with the random seed set in the low level forward propagation routine do seem to show that. But when he ran the experiment without the seed set in the low level code, then the difference between the two methods is not very strong. My claim is not setting the seed in the low level code is the way dropout was really intended to work, so the final conclusion seems to be that the method of using the column vector for the mask is very close to the method of using a different mask per sample. That’s the way science works: sometimes you have a theory, but when you actually do the experiment it doesn’t work out the way you expected. :nerd_face:

2 Likes