Complementary mindmap: Various possibilities of regularization

Nothing particularly special, really.

All have been seen in the course except the “Standard Dropout”, related to the “Inverted Dropout” and I had the idea of just removing input features at random.

Also: graphml file associated to the image

2025-03-01: Updated image

2 Likes

Thank you for sharing the map!

1 Like

Thank you Raymond.

Here is a question though:

I thought the name “L2 Regularization” was only used for the weight vector in Logistic Regression, but it’s also used in the general case of weight matrices. This happens in the subsequent programming assignment.

Is this quite correct?

In fact, the “L2 norm” is declared as the norm of complex-valued vectors at Wolfram Math World

Right, the name can be used in Logistic regression, Linear regression and Neural Network.

And also applicable to real vectors as also indicated in the linked page.

See this for some discussion with matrix, and this in Wolfram Math World.

Well, “real vectors” are just a special case of “complex vectors”, so yes. :sunglasses:

1 Like

This brings up the following thought:

  • Removal of nodes is ALSO regularization (dropping unwanted complexity) though it’s not mentioned in the course.

This would probably need to be followed by complementary training.

:thinking: Someone must have tried this.

What nodes would one drop though? Maybe the ones that generate the highest-frequency changes in the output when their output changes? Throw an FFT at it.

Or one could try with a population of networks with randomly dropped nodes and throw a genetic algorithm at it. So many possibilites.

You may be interested in “pruning”. For example, as this wiki article says,

The goal of this process is to maintain accuracy of the network while increasing its efficiency.

Therefore, the direction may not be like what you were thinking - to decrease variance. While you can google “neural network pruning” for more references, this article may be interesting to you, too.

Cheers!

1 Like

Reducing variance through removing trained nodes does not sound quite an ideal approach to me. In my opinion, we start from small network and grow it to reduce bias, and we add regularization to control the variance. So, it’s not most natural that we suddenly want to shrink the network for variance reduction because then we shouldn’t have grown it to such large in the first place.

1 Like

Maybe not. But what if you have to first climb the hill to find the solution in the valley below?

Anyway, the tweet from the article you indicated to shrink the network:

It sounds like something to try.

This I can’t just say impossible, because I have not investigated in that. However, if people who did pruning a lot found it often improved performance, then I suppose we wouldn’t have just read this:

However, I admit that it could just be because they had focused on how to reduce network size while maintaining performance, instead of to improve performance by reducing size. For example, in that tweet, their choice of weights to prune was based on “lowest magnitudes”, but your choice was “highest-frequency changes in the output”. Intentions.

While “lowest magnitudes” is quite intuitive for their intention, the problem now is, is there any criteria for us to say that a certain node is possibly more accountable for high variance? This I am not sure.

Next, I am just sharing some thought process and it may turn out to be not making any sense at all… In high variance model, we may see large change in output when there is a small change in input, so the small change get amplified across the network? Then what kind of nodes can amplify it? If the input to the node contains only positive value, then node with large sum-of-weights will amplify it more, but if the input contains any value, then it may be difficult to say. If the output of the node is always negative, then after ReLU, it becomes zero anyway. So, the criteria seems going to be complicated?

For which intention? For reducing network size while maintaining performance? Or for reducing variance to improve performance?

1 Like

Adoping your hill-valley analogy, were you meaning that performance of B > A while allowing the performance of C > B? I was thinking about B > A and B > C.

The latter.

  • So you start with few nodes.
  • Then you add nodes until you have something that low bias
  • But then you find it has too much variance.
  • So you prune nodes again.
  • And maybe that gives you good solution with appropriate variance & bias.

But none of that may make any sense, I’m just going by the feels.

Here is another thing that one can try for regularization:

  • All of this stuff is based on floating-point numbers up to 64 bits
  • What happens if one reduce the allowed range, switch to 16 bit floats maybe (this is also advantageous to avoid excessively complex hardware and excessive energy expenditure). What kind of range do we really need?

Maybe the network starts to generate really bad answer, but that in itself would be a sign that one is relying on little noise in the computation.

That’s usually how things begin … :raised_hands:

Besides pruning, another technique for shrinking the size of a trained network is called “quantization” which does exactly the kind of thing you mentioned: from 64-bit to, for example, 16-bit, which gives a 75% cut. Tensorflow has this page cover both “Post-training quantization” and “Quantization aware training”. However, again, network shrinking might not be your focus.

If we think that too much precision causes high variance, besides using 16-bit, we also have this “GaussianNoise” layer which is really considered as a regularization layer in Tensorflow and we can use it, at training, to perturb the input to a layer.

I think your ideas make a lot of sense!

Cheers,
Raymond

1 Like

Thank you for those further pointers. As a final input to the discussion, I found this review paper:

I also updated the mindmap somewhat, adding better text.

1 Like