Prof Ng explained both those things in the lectures. If you did not understand what he was saying, I’ll give you my explanation, but then you really should just go back and listen to what he said with this in mind:
For the dividing by keep_prob
, you are compensating for the fact that you “killed” (1 - keep_prob)
percentage of the neurons with dropout. So the next layer will be seeing less total “input activation energy” unless you compensate by increasing the outputs of the non-killed neurons to compensate for the percentage that you “killed”. Remember that when we run the trained network to make predictions, there will be no dropout at all. So if the following layers are trained on less input, then they won’t work as well when they get the full input.
On the question of the cost function, the point is that what dropout does is literally change the architecture of the network. The cost function is a mapping from the inputs to the final scalar cost J and you’ve literally changed the function on each training iteration by changing the architecture of the network by killing some percentage of the neurons. Of course the other point here is that dropout is random and different on every iteration. So from a mathematical p.o.v. it’s a different function on every iteration and the values are not comparable. To be precise you would compute the accuracy at any point in the training by evaluating it with keep_prob = 1
to disable dropout. Of course that’s also how you will be using the network once the training is complete, as I mentioned in the previous paragraph. So that gives you a better (more honest) evaluation of the performance of the network.
There have been lots of threads about your first question. E.g. here’s one. If you read all the way through that one it also discusses the fact that the way Prof Ng has structured this is different (better) than the way they handled the inverted dropout in the original Hinton paper that introduced dropout.
But as I said at the beginning, you really should listen to what Prof Ng says if you missed all the above on the first time through. I’m not making this stuff up: everything I said above is just me restating something Prof Ng said in the lectures in my own words.