Is there any advantage to training a a model for say 10000 iterations before enabling dropout regularization?

I’m imagining a network which is trained to a somewhat good fit from iterations 1-10,000, then iterations 10,000-50,000 it uses dropout regularization to avoid reliance on any given neuron.

It’s an interesting thought which had never occurred to me before. My initial reaction was the same as Tom’s: in everything that we’ve ever seen in these courses, the forms of regularization (L1, L2 or dropout) are applied in all iterations of training starting from the beginning. Or to state it more fully, we only apply regularization after we run the training without it and realize that we’ve got an overfitting (high variance) problem. Then we just start the training again from scratch with regularization applied and of course we may have to try several times to tune the \lambda value (for L2) or the “keep prob” for dropout. But maybe you are on to something. One possible intuition would be that overfitting doesn’t happen from iteration 0: it takes a while to get there, so maybe the training would run faster overall (fewer total iterations) by starting out without regularization and then “dialing it in” when you get to some point in the training where the training and test accuracy start to diverge. Maybe you could even make the dropout rate be “adaptive” based on the comparative behavior of training and test accuracy values as you progress. It’s just an intuition, but this is an experimental science: you could try it and see what happens. Maybe your idea actually works and you can publish the paper and a year from now everyone will be doing Pan Adaptive Dropout!

The other approach to investigate further would be to do some searching and see if you can find any pre-existing papers about adaptive dropout or adaptive regularization in general. Or to look at the code in TensorFlow or PyTorch to see how regularization is handled. Those packages are all “open source” I believe, but I have not actually tried looking at the source.

Actually, I was thinking that dropout regularized training would actually be faster than unregularized training. The networks are effectively smaller by a factor of 1-keep_prob per hidden layer are they not?

It is my understanding that the intuition for how dropout regularization works is that it disallows overreliance or attaching high weight any given neuron and that will be a lower variance network.
I was thinking that this probably isn’t strictly true, and sometimes you may want somewhat high weight/bias/important neurons. Late enabling of dropout would allow some neurons to grow in importance before applying the regularizing effect. Potentially this could lead to a weight and bias matrices which could look very different from one where dropout is enabled from the start. It seems you could balance high weight/bias neurons vs a more homogenous weight/bias network this way.
I’m still learning the math, but do you think this would be the behavior of late-enabled/adaptive dropout regularization?

In order to make it adaptive based on comparative behavior, would you need to store every weight and bias matrix for each iteration of training? the cache thereby enabling the ability to go back in time and choose the point on the cost function gradient descent to enable adaptive dropout regularization? Is storing the weight and bias matrices during the gradient descent process standard practice?

Well, only logically smaller, right? The matrices that you are multiplying are the same size as they would be without dropout, but 1 - keep_prob percentage of the elements happen to be zeros (on average). That doesn’t save you any real compute time in a vectorized computation unless the number of zeros predominates and you can use “sparse” operations. But my guess is that situation would not really arise.

But you still have the base cost function which is driving the weight and bias values toward lower cost on the training data. So that doesn’t mean that the magnitudes of all the values are necessarily reduced to similar values. But I think we’re saying the same thing there. You have a theory there, which you could test. If it gives better results, that would be interesting and relevant.

It is not normal to store weight values in every iteration. What would be the point? The normal practice is to checkpoint the trained values periodically but only save the latest checkpoint. That allows you to restart your training at that point if your process gets killed by Colab or whatever. In terms of adaptive behavior, I was thinking of just monitoring the accuracy and cost values and then dynamically turning on or modifying the regularization behavior based on that. But I should say that this is also just a theory on my side: I don’t have any actual experience trying things like that. Just some ideas or suggestions, which are probably worth what you paid for them.