Why do we change the dropout neurons in each iteration for every training data point? Shouldn’t we keep the same nodes across all training samples for a given iteration? Essentially, shouldn’t D be A[l] x 1 matrix instead of being the same shape as A?
I am trying to develop an intuitive understanding of dropout regularization. I understand it’s ok to dropout for each training example in each iteration as well. But trying to understand why that is better than just using one set of dropouts for the whole iteration (across all training examples)
This is an interesting question that has come up before. In the limit, if you were doing Stochastic GD, then there would be one sample in each minibatch so each sample would be handled differently w.r.t. dropout. The way Prof Ng has us do it has that behavior even if the batchsize is greater than 1. Here’s an earlier thread which discusses this point and even shows some experimental results with both methods.
Good discussion, thank you. I like the intuition of having a stronger effect. It looks like the mini-batch approach (reset dropouts for each sample) always seems to deliver better results, although sometimes close. I am trying to develop a feel for why that would be the case, or if it just random chance with this particular neural net/this particular set of hyperparams.
I found the original paper introducing dropout regularization and found some insights there:
Applying dropout to a neural network amounts to sampling a “thinned” network from
it. The thinned network consists of all the units that survived dropout (Figure 1b). A
neural net with n units, can be seen as a collection of 2n possible thinned neural networks.
These networks all share weights so that the total number of parameters is still O(n
2
), or
less. For each presentation of each training case, a new thinned network is sampled and
trained. So training a neural network with dropout can be seen as training a collection of 2n
thinned networks with extensive weight sharing, where each thinned network gets trained
very rarely, if at all.
So the way I understand is similar to what you suggested. If we keep the dropout same for a whole iteration, that particular combination of dropouts becomes too strong and we are not getting the full stochastic effect. In practice, it can potentially turn out that the non mini-batch version worked better for some particular combination of NN+hyperparams, but the general theory makes more sense if we increase the randomness as much as possible so it makes sense to drop out separately for each sample. At least that’s my takeaway
Yes, that sounds right to me as well. One other point that was made on that other thread I linked is that the way they implement things in the assignment notebook, they are setting the random seed the same way on every iteration. So that means the dropout mask is exactly the same on every iteration, which is definitely not what was intended. They do it that way in essentially all the assignments here for ease of comparing “expected values” and for the grader. Of course there would have been a better way to achieve that goal: set the random seed in the test driver code and not in the actual code we are implementing. So if you want to run experiments and want to see the real behavior of dropout, I think you should disable the code that is setting the random seed before generating the dropout masks.