Hello, I have a question about the dropout implementation. When we implement dropout, we need to rescale the A[l] by dividing keep_prob so that the expected output stays the same. I do not quite understand this. Because the dropout is implemented for every iteration, and in each iteration, the realization of the dropout is different from the expected value (e.g., for keep_prob = 0.5 and 3 neurons, it is possible to keep all of them in one iteration but we multiple A[l] by 2 anyway, which will increase the output value).
Why don’t we rescale A[l] using the realized number of dropouts in each iteration directly? Because we can easily keep tracking of the realized number of dropout by looking at the D[l] matrix. Thank you.