I firstly would like to thank you for your collaboration with us to understand in depth.
BTW, the following equation is a little strange to me and I explain it in the way I have understood. I think that is useful to know the number of outputs in the following layer. The equation helps us to keep the eliminated values as 0 just to keep the expected results. For instance, if we expect 50 units, in which 10 of them are 0, we have 40 valued units and 10 zeros in the following layer. without equation, we would have 40 valued elements and there could be some programming difficulties from the point of managing the ranges.
a3 /= keep-prob
I would be appreciated if you verify my point.
Thanks in advance
Hi @S.hejazinezhad ,
Lets try to understand a3 /= keep-prob together. I’ll use my own words to try to help you gain intuition on this:
We know that with Dropout we shut down some nodes in the layer. For example, in a 50-unit layer, with a Dropout of 80%, we shut down 10 units and keep 40. It is important to understand here that we don’t physically remove the shot down units, but instead we set them to zero.
Next, we know that once the Dropout is effected, we then calculate Z = W * a + b. In the case of a3, we would be calculating Z4 = W4 * a3 + b4, right?
Remember that we have shut off 10 units on a3, so their contribution is zero.
And here comes a key question:
What will happen to Z4 if we use a3 with just “80% of its power”? it will certainly reduce the expected value of Z4, right?
How can we solve this?
Well, if we divide a3 by keep-prob, meaning, a3 /= keep-prob, then that will “give more power” to the 80% of active units, right?
Think about this: what happens when you divide 1 by 0.8? 1/0.8 = 1.25 … it is bumped up!
So when we do a3 /= keep-prob, or in the example, a3 /= 0.8, we are basically bumping up all the active units to ‘compensate’ for the missing units.
I hope this explanation gives you some intuition on the reason why we apply ‘inverted dropout’.
Thanks for your feedback. well explained!
I understood. My issue about physical removing of the values. Now, it is fine.
BTW, something is not fully clear to me. when we perform the division, we bump up the remained valuable 80 % of the weights as you said. But will not this make an issue in tuning of the weights?
Let me clarify my doubt another way. Suppose that we have oevrfitting problem in the last layer and we know we should use regularization. If we do this, we will have a smoother version of fitting. In this lecture, it seems we re-tune the well-tuned weights in addition to zeroing some weights. Is not this a problem?
Hi @S.hejazinezhad , good question… so let me try to clarify this point:
First of all, I’d like to clarify that we cannot know that ‘x’ layer is overfitting. The overfitting is determined at the model level, not at a specific-layer level. Now, it is true that we can selectively apply dropout on a per-layer basis. As prof. Ng shared, we may want to apply high overfitting on large layers (say, apply 50%), and low overfitting on smaller layers (say, apply 80%+).
Second, regarding your question specifically, a couple of key aspects to understand:
When we apply Dropout, we are shutting down ‘temporarily’ some units. It is not like we are dropping units completely from the model, but just during a cycle. And on each cycle, we drop out a different set of units, so it is not like we keep shut off the same units.
Every time we drop units, we are forcing the model to not depend on a specific configuration of nodes, so this makes the model more robust, and this is what helps with overfitting.
And the key to your question: The whole cycle (forward propagation and back propagation) is performed with the dropped out units. In other words, on each cycle we drop some units, then we run forward prop on the active units, and then we run backprop on the active units. So we are not erasing previous learning of the shut off units. In other words, we are not zeroing weights.
Does this answer your question?
In addition to Juan’s excellent explanations here, maybe one other point worth mentioning is that dropout, like all forms of regularization, is applied only at training time. That means when we actually apply the trained model to make a prediction on test or real input data, there is no dropout happening: we use all the trained weights. The point is what Juan said earlier: the dropout effect during training causes the weights that are actually learned to be more “robust” or well-balanced and not as likely to overfit to the training data.
Sorry for late reply. I would like to thank both of you.
Totally understood. neither zeroing the weights softly or removing them physically. Just to do dropout in training time to not have strict dependency on some neurons of the model.
What a brilliant method!
It’s great that the discussion was useful. Thanks for confirming!
Another followup question that has come up before is whether the fact that dropout works in a given case means that there is actually a smaller network that we could have started with and trained without dropout that would have achieved the same “Goldilocks” balance between fitting the test data and not overfitting the training data. I don’t definitively know the answer, but it seems likely that this is true. The problem is that finding it is not as practical as using dropout or other forms of regularization. Here’s a thread from a while ago that discusses this point in more detail.
Thanks a lot. I would be useful to have a look at that.