Hello there,
I have this question about why we perform rescalling of Z after dropout. What is the utterly problem with working with a value that it’s not the expected value of Z if we did not perform Regularization? What is the mathematical and aplicational issue of not rescalling? I cannot see the problem with this…
It’s a good question that has been asked and answered a number of times before. Please have a look at the posts on this thread from the given post to the end and see if that covers your question.
The short summary is that you need to remember that the dropout is only happening at training time. When you actually use the network to make predictions, you just use the trained weights. If you don’t compensate for the dropout in that way during the training, then the later layers of the network are trained based on expecting less “energy” than they are actually getting from the previous layers when you actually make predictions.
Sorry if this is getting a very repeated topic for you, I can guess it is. But I did not understand the concept of “energy”. For instance, let’s say I do not rescale Z_value during training time (the issue with training vs testing time is not an issue to me, and I think I can get the purpose of dropout: we are kind of feeding slightly different outputs at each layer of the network so it doesn’t get “stuck” on the ever same structure of values either from forward propagation and back propagation), but the rescalling still bugs me: what’s is the problem of the network to forward propagate and backpropagate with different “energies”? Is the problem that the later layers (the final layers in forward propagation, and the initial layers in backprop) will not be able to learn, since the values can be very very minimum?
I would state the intuition about how dropout works a bit differently than you do. The point is that the neurons that get dropped are different on each iteration, so the effect is to dampen overfitting by weakening specific connections between the outputs at one level and the inputs at the next level. Exactly how strong that weakening effect is depends on the keep_prob value that you use, of course. Maybe that subtlety in the intuition doesn’t really affect the bigger point you are making here, but I thought it was worth stating.
The problem is not that they can’t learn, the question is what they learn. If you don’t do the reverse scaling, then they learn potentially different things: they learn to react to weaker outputs, because that’s what they are trained on. But then the point is what they have learned may not fit as well with the actual data that they see when you run actual predictions without the dropout logic in place, because the outputs have more “energy”. Did you read far enough in that thread I linked to see part about the L2 norms of the outputs. Maybe that was earlier in the thread than the link I gave you.
Maybe you are thinking too hard here. It actually seems like a pretty straightforward argument: you want the training to be closer to what happens in prediction mode. You only want the stochastic weakening of the reaction to particular neuron outputs without a general decrease in the L2 norm of the inputs.
I think I now got your point a bit better…Still not crystal clear, but a bit better. It would be like, without rescaling we would “eliminate data which could change the distribution function of the values used for training” thus training model had been trained without values, or a range of values, that it would have to deal with in testing phase…I don’t know how much sense this makes, I am sorry…
Just a side note on the “what happens in prediction mode”. For the dropout to be a “good thing”, and “rescaling” to be proficuous, I then have to ensure that the parameters of the distribution of values for testing data are the same as for training data, right?
The question of the distribution of the training data versus the test and validation data is a separate matter, right? Of course ideally you always want all your data to be from the same distribution, regardless of whether you are doing regularization or not. The normal methodology is to randomly shuffle all your data in a statistically fair way before subdividing it into the three datasets. But life is not always so simple and easy to arrange. Prof Ng will discuss this in quite a bit more depth in Course 3 of this series, so please stay tuned for that.