Hi all,
I have a question regarding the frequency for which dropout is applied during forward and backwards propagation. In lectures, I was under the impression that dropout is applied for every training example separately. In other words, there’s a random selection (pattern) of the network for every single training example, regardless of whether they are under the same gradient descent iteration.
I want to make sure that my understanding of the implementation in programming assignment 2 for this is correct. To confirm, the implementation does still creates a unique dropout configuration for every example, regardless of gradient descent iteration right?
Given every Al (activations of lth layer), is a Nl x m matrix (Nl being number of units in the layer), does this mean that creating a random array by these dimensions and basing the 0s and 1s for masking on this also then creates unique dropout configurations for that layer for every example? To which, by repeating this process for each of the layers of A (activation values) and combining them in forward/backwards propagation, we do indeed get uniquely random dropout configurations over every example?
Thank you!