I’m quite confusing about the operation: D1 /= keep_prob in the dropout regularization.
In my test, the accuracy with this operation is close to the one without that, which are 92% in train set, 95% in test set, and 93% in train set, 95% in test set respectively.
Is it really necessary?
Just doing one particular test case doesn’t constitute a general proof of anything, right? Perhaps there are cases in which it doesn’t make that much difference. Just out of curiosity what was the keep_prob
value in your experiment? One assumes that Prof Hinton and his group did more extensive experiments before they published the original paper on dropout. One interesting thing to note is that they accomplished the reverse scaling in a different (equivalent but a lot more inconvenient) way in the original paper. Here’s a thread which discusses that point. Please read from the linked post forward on the thread. There is also a link to the Hinton paper included there.
Here’s another recent thread on the point about why the reverse scaling is useful.
Thanks for answering my question, it helps me a lot.