All have been seen in the course except the “Standard Dropout”, related to the “Inverted Dropout” and I had the idea of just removing input features at random.
I thought the name “L2 Regularization” was only used for the weight vector in Logistic Regression, but it’s also used in the general case of weight matrices. This happens in the subsequent programming assignment.
Is this quite correct?
In fact, the “L2 norm” is declared as the norm of complex-valued vectors at Wolfram Math World
Removal of nodes is ALSO regularization (dropping unwanted complexity) though it’s not mentioned in the course.
This would probably need to be followed by complementary training.
Someone must have tried this.
What nodes would one drop though? Maybe the ones that generate the highest-frequency changes in the output when their output changes? Throw an FFT at it.
Or one could try with a population of networks with randomly dropped nodes and throw a genetic algorithm at it. So many possibilites.
You may be interested in “pruning”. For example, as this wiki article says,
The goal of this process is to maintain accuracy of the network while increasing its efficiency.
Therefore, the direction may not be like what you were thinking - to decrease variance. While you can google “neural network pruning” for more references, this article may be interesting to you, too.
Reducing variance through removing trained nodes does not sound quite an ideal approach to me. In my opinion, we start from small network and grow it to reduce bias, and we add regularization to control the variance. So, it’s not most natural that we suddenly want to shrink the network for variance reduction because then we shouldn’t have grown it to such large in the first place.
This I can’t just say impossible, because I have not investigated in that. However, if people who did pruning a lot found it often improved performance, then I suppose we wouldn’t have just read this:
However, I admit that it could just be because they had focused on how to reduce network size while maintaining performance, instead of to improve performance by reducing size. For example, in that tweet, their choice of weights to prune was based on “lowest magnitudes”, but your choice was “highest-frequency changes in the output”. Intentions.
While “lowest magnitudes” is quite intuitive for their intention, the problem now is, is there any criteria for us to say that a certain node is possibly more accountable for high variance? This I am not sure.
Next, I am just sharing some thought process and it may turn out to be not making any sense at all… In high variance model, we may see large change in output when there is a small change in input, so the small change get amplified across the network? Then what kind of nodes can amplify it? If the input to the node contains only positive value, then node with large sum-of-weights will amplify it more, but if the input contains any value, then it may be difficult to say. If the output of the node is always negative, then after ReLU, it becomes zero anyway. So, the criteria seems going to be complicated?
For which intention? For reducing network size while maintaining performance? Or for reducing variance to improve performance?
Adoping your hill-valley analogy, were you meaning that performance of B > A while allowing the performance of C > B? I was thinking about B > A and B > C.
Then you add nodes until you have something that low bias
But then you find it has too much variance.
So you prune nodes again.
And maybe that gives you good solution with appropriate variance & bias.
But none of that may make any sense, I’m just going by the feels.
Here is another thing that one can try for regularization:
All of this stuff is based on floating-point numbers up to 64 bits
What happens if one reduce the allowed range, switch to 16 bit floats maybe (this is also advantageous to avoid excessively complex hardware and excessive energy expenditure). What kind of range do we really need?
Maybe the network starts to generate really bad answer, but that in itself would be a sign that one is relying on little noise in the computation.
If we think that too much precision causes high variance, besides using 16-bit, we also have this “GaussianNoise” layer which is really considered as a regularization layer in Tensorflow and we can use it, at training, to perturb the input to a layer.