It’s outlined in the lectures that increasing the size of a network can mitigate bias, but in order to mitigate variance one of the techniques discussed is dropout regularization, where nodes are randomly dropped from the network. Based on how bias is dealt with, it seems to me to logically follow that you could deal with variance by reducing your network size, which would seem to have the same effect as the dropout regularization, and would be much easier to implement.
So my question is, why even bother with dropout regularization when you can achieve the same or similar effect through easier means? Am I missing something?
It is an interesting point. At a high level, what you are saying must be correct. I do not really know the definitive answer, but if people as smart as Geoff Hinton, Yann LeCun and Andrew Ng think that dropout is a useful concept, then there must be more subtleties here. Maybe you would find some comment on this point in the original paper from Prof Hinton’s group which introduces dropout.
Here is my guess about regularization in general: it’s just easier to tune than trying to tune the architecture of your network. If you think about it, there are lots of ways you can change the architecture of your network: you can add or subtract layers and you can add or subtract neurons from any of the various layers. That’s a lot of degrees of freedom and hence a large search space to explore. Maybe it’s simpler just to make your network a bit bigger than you really need (“overkill”) and then just “dial in” a bit of regularization to damp down the overfitting (high variance) that may result. E.g. in the case of L2 regularization, you only have to do a binary search on one hyperparameter \lambda. In the case of dropout, it’s a little more complex in that you have to choose which layers to apply dropout and you could even use different keep_prob values at different layers. But the point is that you still have fewer “knobs to turn” and maybe that saves work overall. That’s just my guess not based on any actual knowledge .
Or to state my conjecture in the same terms in which you asked the question: maybe tuning the size of the network is not easier. It may be conceptually simpler, but in practice it’s not actually easier to execute. Regularization is actually the easier path to achieve the desired balance between bias and variance.
Actually this seems like an instance in which the famous A. Einstein quote applies: “In theory, theory and practice are the same. In practice, they’re not.”
Model combination nearly always improves the performance of machine learning methods. With large neural networks, however, the obvious idea of averaging the outputs of
many separately trained nets is prohibitively expensive. Combining several models is most
helpful when the individual models are different from each other and in order to make
neural net models different, they should either have different architectures or be trained
on different data. Training many different architectures is hard because finding optimal
hyperparameters for each architecture is a daunting task and training each large network
requires a lot of computation. Moreover, large networks normally require large amounts of
training data and there may not be enough data available to train different networks on
different subsets of the data. Even if one was able to train many different large networks,
using them all at test time is infeasible in applications where it is important to respond
quickly.
Dropout is a technique that addresses both these issues.
They are talking about this in a slightly different terms than I phrased it in my earlier reply, but I think by saying “training many different architectures is hard” they mean what I was saying about the search space just being too large. If you are going to change the architecture, there are just too many choices. In order to know which ones work, you have to run the training on each possible choice.
Thank you for the paper, I will try to read when I find the time.
In your most recent response I think I overlooked one detail from the lecture where it seems like, if I’m not mistaken, dropout is implemented not just once, but several times, and the results are consolidated at the end? I think the quote in the paper is talking about how infeasible it is to further test all of the different architectures resulting from the dropout, but I can see where that might extend to testing different manually chosen architectures as well.
But I guess one thing I’m not quite understanding, and was one of the main reasons for my initial question, is that the process of choosing values for the dropout itself seems to involve a lot of parameters to choose. You can choose a different keep_prob value for each layer, and this would effectively amount to re-choosing the number of nodes per layer manually (or would approach this effect as you increase the number of nodes per layer, or the number of iterations of the dropout). The amount of iteration needed to truly tune the hyperparameters would seemingly be pretty similar to manually reducing the network.
However, dropout seems to induce more non-linearity in the structure of the network than choosing the network manually might. Maybe choosing the network manually wouldn’t give enough resolution in modifying the bias/variance of the model? Especially if, as I said (and again, if I understand correctly), several different dropout architectures are consolidated at the end. It reminds me of a random forest classifier, except in this case it would be more like a random forest where each decision tree is a separate neural network.
To correct myself on the consolidation idea, that’s due to the number of iterations needed to train the model, as with any supervised model, right? And so due to the randomness in the dropout regularization, this results in a different network architecture upon each iteration? This seems like it could be a powerful technique, and I believe Andrew said in the lecture that this effect would prevent any one feature from being overemphasized.
Yes, I think the point you make in the last paragraph is the key one. The power of dropout is that you are actually sampling a different network literally on every training sample and on every iteration. So this has potentially a more powerful effect than just using a smaller network: it is a larger network where the coupling between the different outputs at any given layer is subtly (or not so subtly, depending on your dropout probability) weakened by the stochastic effect of dropout.
And in terms of the complexity, just because you could choose different dropout probabilities at every layer doesn’t mean that you start out that way. The simpler strategy would be to start with uniform probabilities and see if that suffices. Only when it doesn’t would you try the more complex strategy.
Interesting way to think about it - that drop out regularization in-effect is like training different model architectures.
Alternate way, I like to think of it, is how L1 & L2 regularization differ in case of regression. L1 regularizes by turning-off some features, while L2 keeps everything but reduces their weights - that way the model still retains the flexibility from more features/ interactions. I view drop-out regularization the same way. It allows us to keep a big network that can model complex relationships and interactions but reduces individual weights. On the other hand, if we made our network smaller, it simply won’t be able to model complex relationships.
Maybe I have a silly question to ask:
The whole purpose of drop out is to combat overfitting by reducing the network complexity, thus decreasing activation values at each layer. So I don’t understand why the dropped portion has to be compensated by dividing the keep_prob factor? If it has to be compensated, why do we need dropout at all in the first place? If the dropped portion is restored, then why does it still have the expected dropout effect?
Let me try to explain this with an exercise in which we don’t re-scale:
As you know, when you drop a number of units in the layer ‘n’, the model is “diluting” the value of the activations from the previous layer. What would happen if we don’t re-scale? lets see:
Lets say ‘n’ = 3, so we will work on layer L3
We are receiving the activation from the previous layer in A3.
We calculate the dropout with keep_prob and this will shut off some values on A3 (convert them to zero).
Then we move to calculate Z4 = W4*A3 + b4
For the sake of this explanation, lets not re-scale A3. So what will happen to Z4? it will be receiving the ‘diluted’ values from A3, right? Lets say, Z4 will be ‘weaker’. So we continue the forward prop with a weaker value. We finish the forward prop, and then the backprop and repeat the cycle. So we arrive again at L3 and repeat these steps. The next Z4 will be even weaker than in the previous iteration. And this repeats on every cycle of the training.
So how do we avoid this weakening effect? by re-scaling. When we re-scale the A3, we increase the values of the ‘live’ neurons of A3 proportionally to the keep_prob:
We do A3 = A3 / keep_prob. Since keep_prob < 1, the effect is that the live values will be bumped-up. When we do this, the next Z4 will result in a value that is ‘strong’ and not ‘weak’.
I hope this sheds some light to your question! Please let me know if this is still not clear and I’ll try to help some more.
I think your interpretation of why dropout works is incorrect. It is not the fact that we are reducing the activation outputs that is the point: it is the subtle weakening of the dependence of a given neuron on the specific inputs from the previous layer. The point is (as described in the lectures) that you are sampling a different slightly reduced network on every iteration and on every training sample in the batch. This stochastic effect of weakening the connections is what reduces the overfitting. But note that when we actually apply the trained network to make a prediction, dropout is no longer used: we simply use the trained network. That is true of all forms of regularization: they are only applied during training, not during inference. So if we don’t compensate for the reduced “expected value” of the activations, then the network in inference mode will not work as well because it’s been trained to expect less total activation value but it gets values from all the neurons in inference mode.
Hi, @Juan_Olano : From your answer, I can understand better: even the total amount of values are approximately restored by re-scaling, the dropout effect is still achieved: the network complexity is reduced by shutting down random units on a layer. So it is roughly like that: for the subset of the remaining neurons, in each iteration they each carry more weights individually compared to each unit from a full set of neurons.
For my second part of the question, it seems @paulinpaloalto’s links provided a more clear explanation of the question. In summary, due to the need of the testing phrase, the activations have to be rescaled to the almost same level as the pre-dropout level; otherwise, the test phrase will conceptually operate on a different model. In another words, if we just talk about the training phrase, there is no need for rescaling, because making dropped layers weaker is the purpose of dropout.
@paulinpaloalto I read through all the linked answers you provided and I think I understand how it works now. The implementation from the original paper does like this: prediction = A * keep_prob, to roughly reduce the same amount of energy that’s dropped in training time, to match the activations produced by layers in training. So in training phrase, there is no rescaling involved at all but dropout itself. This seems easier to understand how and why dropout works. But the implementation only involving training phrase probably makes the use of the model easier at testing time, and that’s why Prof. Ng. taught this way of implementation.
That’s a really interesting insight. Prof Ng does mention in the lectures that the cost function is different on every iteration because of the fact that the architecture of the underlying network is literally different. But I had not thought of the effect that would have on the gradient descent process.
But note that this is effect is only happening during training: once you finish with training, the model is just a normal network and is applied in inference mode without any dropout active. The effect of dropout is that the weights for the final network give better results in terms of the balance of variance and bias on the test data.
But what if in the backprop step the weights of active neurons in L3 were increased so that next time around Z4 actually gets stronger despite the dilution. In other words, if we don’t re-scale, why wouldn’t the Neural Net “learn” to make weights bigger during parameter updates in L3 so that next time around Z4 is actually not weaker but stronger.
Also, scaling can be done at test time too right? Provided we did not dynamically change keep_prob during training? So in this case also, the output signal gets progressively weaker during training right? But scaling at test time works too.
Lastly, what happens if we don’t do scaling during training or test time? Can’t we just run say 100 rounds of predictions with dropouts of same rate as training and average out the result?