i am a student.
i need to know m i right about dropout, mainly the activation.
i watched Andrew’s explanation 3 4 times and read a post in discussions about my doubt.
so am i right when i say,
we scale up a3 by 0.8 as to get similar activation energy when we don’t dropout.
lets say for example A when we randomly drop some neurons , and we didn’t scale other neurons up, the activation came out to be 8.
but when we TEST the model on this example, we feed it in but as we use all the neurons in testing the activation value might go well beyond 8 ( lets say 13) because of extra neurons.
So, to solve this issue( training activation is far less (8) than testing activation (13) for same example) we are scaling other neurons UP.
8 and 13 are just used as a value for my doubt.
Note that any kind of regularization (dropout, L2 or any other) happens only at training time, not test time. The “reverse scaling” that we do when dropout is happening is as you say: it is to scale up the outputs that are not “zapped” by dropout so that the subsequent layers get roughly the same amount of “energy” from the dropout layer. Then at test time, we use all the trained neurons and neither dropout nor the reverse scaling happens.
If what I said above doesn’t answer your questions, here’s a thread from a while back with more detailed discussion on the scaling issues for dropout.
i have read this thread earlier then after i asked.
i understand that dropout or any other regularization is not used at testing.
my only question is am i right with my logic that
with the same NN architecture , if for an example the activation came out to be something
but with zapped neurons , the activation might change significantly FOR THE SAME EXAMPLE.
so to solve this issue we bump up the rest of the neurons.
my apologies if i wasn’t clear, i used testing and training terms just to depict that when testing we use the whole NN architecture and with dropout we use a smaller one.