Dropout Frequency

Hi all,

I have a question regarding the frequency for which dropout is applied during forward and backwards propagation. In lectures, I was under the impression that dropout is applied for every training example separately. In other words, there’s a random selection (pattern) of the network for every single training example, regardless of whether they are under the same gradient descent iteration.

I want to make sure that my understanding of the implementation in programming assignment 2 for this is correct. To confirm, the implementation does still creates a unique dropout configuration for every example, regardless of gradient descent iteration right?

Given every Al (activations of lth layer), is a Nl x m matrix (Nl being number of units in the layer), does this mean that creating a random array by these dimensions and basing the 0s and 1s for masking on this also then creates unique dropout configurations for that layer for every example? To which, by repeating this process for each of the layers of A (activation values) and combining them in forward/backwards propagation, we do indeed get uniquely random dropout configurations over every example?

Thank you!

Hi, @kzed.

Your understanding seems correct to me. For each training example in a mini-batch, you sample a thinned network by dropping out units (source).

If that were not the case, instead of an Nl x m matrix you’d sample an Nl x 1 vector and broadcast it along the dimension of m, right?

I hope you’re enjoying the course :slight_smile:

1 Like