Hi,
I find these sentences in the videos and text about why we need to divide the activation function of each layer by the corresponding layer’s keep_prob. Such as:
In order to not reduce the expected value of A[L]…
You’re assuring the result of the cost will still have the same expected value as without dropout… DOESN’T COST CHANGE IN EACH ITERATION?
Still, I am not 100% sure that I understand it deeply. I think that when we remove 20% of the weights in the case when keep_prob is .8 and then multiply the neuron values by 10/8 to scale up the remaining weights. But I am not sure why this is necessary?
Thanks
If we remove 20% of neuron units with keep_prob = 0.8, then, total amount output from “remaining” units is 80% since 20% of units do not provide any. If we use 5 hidden layers and set keep_prob = 0.8, then, the final output is 0.8^5*(original\ output) = 0.32768 * (original\ output). If we add more layers, then, we will lost the output. To avoid this situation, we want to keep the amount of outputs of a single layer equal to the original one. In this sense, ((original\ output)*0.8)/0.8 can keep the amount of total outputs even some of units do not work with Dropout.
The point is that (as Nobu showed) we are eliminating a certain percentage of the outputs of the layer on each iteration. But once the network is fully trained and we are using it in “prediction” mode, we will not be doing dropout at all. So we want the next and subsequent layers to be trained on the standard amount “activation energy” that the layer is producing. That is why we compensate for the dropped neurons in that way.
This point has been discussed before. Here’s a pretty long thread that goes over these points and even illustrates the point about the norm of the output activations with actual examples (but you need to read all the way through, not just the first couple of posts).
1 Like
Paul, I appreciate your addition. I forgot to describe the differences between the training time and the prediction time which is the main objective of this operation.