Hello,
I actually have few doubts related to dropout technique as follows:
-
Firstly I didnot understand why we divide by the keep_prob (as show in the attached image), what purpose does it serve is something that confuses me.
-
Secondly when we apply dropout technique, why is it difficult to track the cost function graph, as Prof Andrew Ng explains that to run the NN without dropout and see if its working fine and then apply dropout, this part i didnot understand, can any one explain me this in detail please…
Any help for this is highly appreciated, thank you in advance…
Prof Ng explained both those things in the lectures. If you did not understand what he was saying, I’ll give you my explanation, but then you really should just go back and listen to what he said with this in mind:
For the dividing by keep_prob
, you are compensating for the fact that you “killed” (1 - keep_prob)
percentage of the neurons with dropout. So the next layer will be seeing less total “input activation energy” unless you compensate by increasing the outputs of the non-killed neurons to compensate for the percentage that you “killed”. Remember that when we run the trained network to make predictions, there will be no dropout at all. So if the following layers are trained on less input, then they won’t work as well when they get the full input.
On the question of the cost function, the point is that what dropout does is literally change the architecture of the network. The cost function is a mapping from the inputs to the final scalar cost J and you’ve literally changed the function on each training iteration by changing the architecture of the network by killing some percentage of the neurons. Of course the other point here is that dropout is random and different on every iteration. So from a mathematical p.o.v. it’s a different function on every iteration and the values are not comparable. To be precise you would compute the accuracy at any point in the training by evaluating it with keep_prob = 1
to disable dropout. Of course that’s also how you will be using the network once the training is complete, as I mentioned in the previous paragraph. So that gives you a better (more honest) evaluation of the performance of the network.
There have been lots of threads about your first question. E.g. here’s one. If you read all the way through that one it also discusses the fact that the way Prof Ng has structured this is different (better) than the way they handled the inverted dropout in the original Hinton paper that introduced dropout.
But as I said at the beginning, you really should listen to what Prof Ng says if you missed all the above on the first time through. I’m not making this stuff up: everything I said above is just me restating something Prof Ng said in the lectures in my own words.
2 Likes
There are several additional interesting and slightly more subtle points in how they have us build dropout in the assignment: as they usually do here, they set the random seed to a fixed value, so that it’s easy for them to write the test cases and the graders. But that means we’re literally dropping the exact same neurons on every iteration, which is not how dropout really works. Here’s a thread which discusses that point and even includes some experiments.
Then the other point is that if you are just dropping the same neurons every time, isn’t that equivalent to just defining a smaller network to start with and then dispensing with all the dropout business. Maybe, but here’s a thread which discusses that point a bit more.
1 Like