While using inverted dropout for regularization, during every iteration of gradient descent we are already dropping out and scaling activations during the forward pass. Why is it needed to explicitly set the switched off dAs to 0 and then scale the dAs in the backward pass? Shouldn’t it automatically happen as a result of the dropouts in the forward pass?

My logic:

If we implement inverted dropout on AL then shouldn’t dAL already have the corresponding dropped elements zero’d and the remaining elements scaled up? After all, dAL is the gradient of AL.

Would be great if somebody could clarify this please!

Hey! In Forward prop, indeed neurons are zeroed out. But it is important to consider how it is implemented. We zero out the effect of the neurons, not by actually removing them, but by multiplying the activations with a matrix of the same dimensions with some 0s. This makes some activations 0 as if the neurons had been turned off. Since we didn’t really change the network architecture and those neurons still exist, in backward propagation, we once again have to explicitly zero out the gradients for the neurons which were initially turned off.

d[L]=(np.random.randn(a[L].shape[0], a[L].shape[1])<keep_prob) #d[L] is now vector of 0s and 1s based on keep prob for layer L.

a[L]=np.multiply(a[L],d[L]) #here we multiply all calculated activations with d[L] as if the neurons of activations which give 0 have been turned off.

Logically, it is natural to think if we turned of neurons, they should stay turned off for the backprop, but the key here is to understand we aren’t actually eliminating/turning them off, only simulating that effect by multiplying by a matrix with some random 0s based on keep_prob.

The reason why we simulate turning the neurons off by multiplying and not truly remove them is because in the test time, we actually need all neurons to be turned on.

1 Like


Thanks for the prompt reply!

I think I understand it now. Even though Al may be set to zero, it doesn’t mean that dAl will be zero as a result. The value of dAl will depend on activations of layers to its right (As calculated via chain rule during backprop) and will not work out to be zero just because Al has been set to 0. Hence to simulate the effect of switching off the neuron we have to explicitly set it to zero.

Is that the correct understanding?

1 Like

You’re absolutely right!

Great, thanks for clearing that up!