Please explain me why have we divided a3 by keep-prob in this slide .
If possible please explain this with an example.
Also when we implement inverted dropout as in the slide , If we think individually on 3rd layer for each training example , then the no of units dropped out might be different resulting to a different neural network or each training example .
I think this might be a problem as we have to come to a neural network eventually .
Explain me if this is wrong
Prof Ng discusses that point in the lecture. If you missed that, I suggest you rewind and watch it again. You can use the interactive transcript to find the relevant part of the lecture.
Here’s a previous thread about this point as well. You can read from that post forward through the thread. Interestingly in the original paper by Geoff Hinton’s group, they don’t do it that way and it makes things quite a bit more complicated.
Here’s another thread about it.
And here’s one where it actually shows the effect on the L2-norm of the activation output.
You’re right to observe that not every sample within the given batch (or minibatch) is treated the same in any given iteration. Here’s a thread which talks about that point and also if you read all the way through shows some experiments which demonstrate that it doesn’t make that much difference if you treat all the samples in the batch the same.