Inverted Dropout

Hi Mentor,

We had couple of doubts. Can u please help to clarify?

  1. What does this statement meaning { output at test time = Expected output at training time } we cannot understand the intuition behind this statement?

  2. I don’t know why should we compensate Z4 which is divide by keep_prob, if we do like that, divide by keep_prob the involvement of dropout gets gone right for that layer ? I try to mean like we are applying drop out for that nodes in the layer then if we do divide by keep_prob means power of dropout goes out right, dropout has no effect right ? Can you please me to understand how the power of dropout even stay after divided by keep_prob? Please kindly help to answer this and getting feel of demotivated.

  3. what does it mean expected activations values dont change ? is it meaning training set s3 and test set a3 values equal to each other when we divide by keep_prob ?

Thanks,
Thayanban

Dear Mentor can you please help on this?

1 Like

Dear Mentor can someone pls to answer the questions ?
@bahadir
@nramon
@eruzanski
@javier
@marcalph
@elece

I don’t understand the statement in your question 1. Can you give a reference to where Prof Ng says that? The offset into the video would be most useful.

For 2) and 3), the point that you are missing is that dropout zeros certain specific neurons on each iteration. The actual neurons that are “zapped” are different (randomly) on each sample on each iteration. Then we need to compensate for those particular missing neurons by slightly increasing the magnitude of all the other neurons that we did not “zap” in that particular iteration. Thus the total amount of “activation energy” stays (roughly) the same, but it comes from different neurons. The whole point of dropout is that it weakens the connections between particular output neurons and the input neurons at the next layer. But we don’t want an overall reduction in the amount of “energy” being output, as expressed (for example) by the 2-norm of A for the layer. In particular for point 3) Prof Ng is making an analogy to the concept of “expected value” in statistics. Even though we are zapping some neurons in the layer each iteration, we want the “expected value” of the activations viewed at the aggregate level to stay roughly constant.

8 Likes

Sir,

Actually we should provide dropout activation output values only need to provide to the next layer input. ?

But, after dropout, doing divide by keepprob will be going to further reduce the non zapped hidden units ? Then How its going to increasing the magnitude of other neurons because we are using divide by operation but not multiply operation ?

Also After dropout, not zapped neurons , not dropout neurons will stays the same right (d3 boolean matrix element wise multiply with a3 , a3 non dropout stays same , because when a3 multiply with 1 => a3 * 1 =a3) By so non zapped activation values stays same means then why need to do compensate sir?

Yes, if we don’t do anything the individual values of the non-zapped neurons stays the same. But that’s not the point, right? There are fewer of them outputting non-zero values right? That’s what dropout is about. The point is about the aggregate amount of output from all the neurons in the layer (zapped and non-zapped) taken together. One way to assess that would be to take the 2-norm of the output activation matrix with and without dropout without doing the 1/keep_prob computation and watch what happens. Then try that same experiment again with the factor of 1/keep_prob. You don’t have to wonder about this stuff: you can actually try it and watch what happens.

Speaking of “watching what happens”, note that keep_prob is a probability, meaning a number between 0 and 1. Try dividing 42 by 0.8 and watch what happens.

1 Like

But if we did dropout, then making compensate other neurons will become equivalent to the unregularized neural network right ? By so concept of regularization becomes meaningless right sir?

And other doubt is For iteration =1, if we have 100 examples passing means, will we end up with 100 different dropout networks or only one NN for all 100 examples with respect to single iteration? Am i right ?

Im not getting this sir…can you please to provide an example if u done already?

But the point is that the effect of dropout is that it randomly weakens the connections between specific neurons and the next layer by making the output of a given individual neuron a little more stochastic and less predictable, because it might get zapped in a given iteration. That’s what causes the regularization effect. The point about compensating for the aggregate “energy” by multiplying by 1/keep_prob is just to keep the general level of output the same even though it comes from different neurons.

Everything I’m saying here is just me repeating what Prof Ng says in the lecture maybe with slightly different wording. Since what I’m saying doesn’t seem to be helping, you might want to go back and watch the lectures on Dropout again. He does a way better job of explaining all this talking at the whiteboard than I can by just typing words. Of course he’s also a way better teacher than I can ever hope to be in any case.

There are two ways to implement dropout: you could make the effect be the same on each sample in a given batch or you can make it different for every sample in every iteration. Prof Ng has us build it the latter way and I think that’s what the original paper also says, but it’s a bit ambiguous.

2 Likes

Did you complete the experiment of dividing 42 by 0.8? What did you get? Now try dividing -0.573 by 0.8. What happens to its absolute value?

The 2-norm of a matrix is the square root of the sum of the squares of the elements of the matrix, right? It is the generalization of the Euclidean length a vector. The interpretation in more than 1 dimension is a bit more complicated than mere “length”, but you can think of it as a measure of the “magnitude” of a matrix. It does actually have a similar geometric interpretation to length when you consider matrices as linear transformations between two vector spaces.

So let’s just create a relatively small matrix with normally distributed values:

np.random.seed(42)
A = np.random.randn(3,4)
print("A = " + str(A))
print("2-norm(A) = " + str(np.linalg.norm(A)))

Running that gives this:

A = [[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975]]
2-norm(A) = 2.672810732482017

Now let’s try multiplying by 1/0.8 and see what happens:

B = A * (1/0.8)
print("B = " + str(B))
print("2-norm(B) = " + str(np.linalg.norm(B)))
B = [[ 0.62089269 -0.17283038  0.80961067  1.90378732]
 [-0.29269172 -0.2926712   1.97401602  0.95929341]
 [-0.58684298  0.67820005 -0.57927212 -0.58216219]]
2-norm(B) = 3.3410134156025215

It’s easy to prove that ||m* A|| = |m|*||A|| where m is a real scalar and A is a real-valued matrix. If you check with your calculator, you’ll see that is what happened here.

1/0.8 = 1.25

1 Like

@paulinpaloalto Thanks a lot sir for your very much answer and patience of my question. Here is my takeaway about inverted dropout. Can u please help to correct if my understanding is wrong? Please I need your help Sir to correct it of my understanding.

  1. Divide by operation increase the values and multiply by 0.5 reduce the activation values (correct or wrong)

  2. We are implementing dropout at training phase so it weakens the connection of neurons between one layer to another layer, thus brings out the regularization effect (correct or wrong)

  3. After brings out the regularizaion effect, we need to divide by keepprob, so slightly increasing the magnitude of activation values due to divide by operation. So expected activations values wont get change at each layer and also at y^hat (correct or wrong)

  4. Suppose if we dont want to apply divide by keepprob at training time, instead of to compensate at training time, we are going to do multiply by keep_prob at test time. Since we dont divide by keep_prob at training time, expected activation values could be reduced, so at test time we must achieve this reduced activation values. In order to do achieve reduced activation values at test time , multiply by 0.5 operation at test time will reduce the activation values of test sets. So training activation values become equals to test activation value. (correct or wrong )

  5. We can also do scaling at test time but we would end up with scaling problem, critical, complicated process. (correct or wrong)

  6. According to earlier versions Why we did scale multiply by keep_prob at test time ? why so means, during training phase due to dropout, we end up with smaller network, so each neuron takes less input signal. But in the test phase, each neuron take large input signal makes activation values high, so multiply by keepprob at test time whose activation values now reduced then become equals to training phase activation values. (correct or wrong)

You never do any kind of regularization at test time. It only happens at training time. At test time, you just use the trained network to make predictions. Even if you used dropout during training, you do not use it at all (either the zeroing of neurons or the scaling by 1/keep_prob) at test time. The point is that both of those computations are part of the regularization and that does not happen at test time.

Please realize that the mentors are volunteers here. We are not being paid to do this, so we do not owe you a detailed answer to all possible questions that you can think of. Given your intense level of interest in dropout, perhaps you would find it worth your time to actually read the original paper from Prof Geoff Hinton’s group that introduced and defined dropout.

Dear mentor,
I also am confused regarding this. Also, After reading the paper suggested by you, my understanding is that zeroing out of neurons with probability keep_prob is dome during training. But scaling by 1/keep_prob is done during the testing phase. This is what is shown in figure 2 of the mentioned paper. Please do tell me if I am missing something.

It is always the case with all forms of regularization (L2, dropout …) that they only happen during training. If they left the factor of 1/keep_prob in the code, then remember that keep_prob will be 1 when you are not actually doing dropout. That’s the way you disable dropout if it is part of the code, so that it doesn’t happen during test. You pass a value < 1 during training, but pass 1 in all other cases.

1 Like

Dear Mentor,
I would like to mention that in the paper they actually use the factor keep_prob (instead of 1/keep_prob) during test pahse. Also to quote from the paper itself:

For any layer l, r^{(l)} is a vector of independent Bernoulli random variables each of which
has probability p of being 1. This vector is sampled and multiplied element-wise with the
outputs of that layer, y^{(l)}, to create the thinned outputs \tilde{y}^{(l)}. The thinned
outputs are then used as input to the next layer. This process is applied at each layer. This
amounts to sampling a sub-network from a larger network. For learning, the derivatives of the
loss functions are backpropagated through the sub-network. At test time, the weights are scaled as W^{(l)}_{test} = pW^{(l)} as shown in Figure 2. The resulting neural network is used without
dropout.

I read your last reply (thank you for that). But it still doesn’t look like the same strategy as mentioned in the paper. Is the strategy that has been explained in the videos different from the paper?

I have not studied the paper, so I can only guess. Here are a couple of thoughts:

  1. Maybe what they mean by “test time” is different than what Prof Ng meant by test time. I think they might mean just when you make a prediction with the resulting trained network to compute the training accuracy.

  2. The other thought is that I think they are talking at a more theoretical level trying to describe how dropout actually works. Whereas Prof Ng is leading us through actually building it as a practical matter. Notice how the code in the notebook calls predict and predict_dec to make actual predictions for both the training and test data. If you open the file reg_utils.py and examine the code for those routines, you’ll see they both call the “plain vanilla” forward propagation without dropout (or L2 regularization).

Also there’s this item in the “Notes” section of the assignment:

  • A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.
1 Like

Dear mentor,
According to the paper

  1. they have a different data set called “test set” which is used to test the trained NN.
  2. Actually this paper gives practical guidelines regarding how to apply the dropout (given in appendix A of the paper). (pardon me if you were not alluding to this but I wanted to mention this as the paper actually acknowledges the difficulty in tuning hyperparameters and gives some guidelines to alleviate the problem)

I completely agree that dropout is done only during training. But the idea behind multiplying the weights with keep_prob during testing is to make the ‘importance’ of weights similar to what was during training ( since we use the full network for the test set).
In any case, I would, for now, follow the videos for the assignments. But I would try what is mentioned in the paper to satiate my curiosity. If I find something noteworthy then I would inform you.
best regards

Ok, I think I understand what is going on now:

The method in the paper is the very original method. It was the first paper about dropout. The method that Prof Ng shows is just a different and arguably cleaner way to achieve the same result. The reason that in the original they needed to downscale the weights by multiplying by keep_prob at test time (and every other time they use them) is that they did not use the method of upscaling the activations by 1/keep_prob at training time. The method Prof Ng shows is just another way to achieve the same result: if you multiply the activations by 1/keep_prob during training, it upscales the outputs, which in turn causes the learned weights (coefficients) to be downscaled at training time. So that means you don’t need to downscale them later when you use them. Once you’re done with training, the weights are the weights and you don’t need to worry about what the keep_prob value was or even that the training involved dropout. It’s just a simpler method of achieving the same result. I bet if Prof Hinton had thought of that formulation at the time, they would have written it that way in the paper.

So the strategy is a bit different to get the same results. This makes sense. Thank you for clearing the doubt.

Yeah, it seems the paper uses "w" for Training and "wp" for Testing, while the lecture explains it as "w'/p" for Training and "(w'/p) * p = w'" for Testing. Thanks for your note.