Inverted dropout, killing nodes or stabbing training examples?

This is a really interesting point that you have noticed: the way we implement dropout, it does not handle all the samples the same way in each iteration. I honestly forget whether Prof Ng makes this point in the lectures or not, but it’s clear from the way the instructions are written in the notebook that this is the way we are supposed to implement it. My intuition is that doing it this way weakens the effect of the dropout. If you used your method of making the “mask” value be a column vector and treating all the samples the same would make the effect more intense. Either way will probably work in the end, but perhaps you would need different keep_prob values to get the same result with the two different methods.

The other way to think about this is that we are typically doing minibatch gradient descent. Using the “per sample” dropout is effectively doing the dropout as if we were doing Stochastic Gradient Descent (minibatch with batchsize = 1). My interpretation is that doing this makes the hyperparameter of minibatch size and the hyperparameter keep_prob independent, meaning that you can tune one without also having to change the other. (This property of “orthogonality” of hyperparameters is highly desirable, as Prof Ng discusses in the section on how to systematically approach hyperparameter tuning.) If you implement it the way you’ve described where each sample in the minibatch is treated the same, then intuitively it seems like that gives some “coupling” between the keep probability value and the minibatch size. I don’t know if that intuition is correct and whether that was the motivation for doing it the way they did, but something to consider.

One other subtle point here is that the template code in the notebook sets the random seed in the actual forward propagation code for simplicity of grading and checking results. But doing it that way means that we literally get exactly the same dropout mask on every iteration, which is definitely not how dropout was intended to work: the behavior is supposed to be statistical. The better approach would be to set the seed in the test logic, not in the actual runtime code. We’ve reported this as a bug to the course staff.

4 Likes