Complementary mindmap: Various possibilities of regularization

dtonhofer · February 27, 2025, 12:52am

Nothing particularly special, really.

All have been seen in the course except the “Standard Dropout”, related to the “Inverted Dropout” and I had the idea of just removing input features at random.

Also: graphml file associated to the image

2025-03-01: Updated image

rmwkwok · February 28, 2025, 7:02am

Thank you for sharing the map!

dtonhofer · February 28, 2025, 7:33am

Thank you Raymond.

Here is a question though:

I thought the name “L2 Regularization” was only used for the weight vector in Logistic Regression, but it’s also used in the general case of weight matrices. This happens in the subsequent programming assignment.

Is this quite correct?

In fact, the “L2 norm” is declared as the norm of complex-valued vectors at Wolfram Math World

rmwkwok · February 28, 2025, 8:12am

Right, the name can be used in Logistic regression, Linear regression and Neural Network.

And also applicable to real vectors as also indicated in the linked page.

rmwkwok · February 28, 2025, 8:14am

See this for some discussion with matrix, and this in Wolfram Math World.

dtonhofer · February 28, 2025, 8:23am

Well, “real vectors” are just a special case of “complex vectors”, so yes.

dtonhofer · February 28, 2025, 11:33am

This brings up the following thought:

Removal of nodes is ALSO regularization (dropping unwanted complexity) though it’s not mentioned in the course.

This would probably need to be followed by complementary training.

Someone must have tried this.

What nodes would one drop though? Maybe the ones that generate the highest-frequency changes in the output when their output changes? Throw an FFT at it.

Or one could try with a population of networks with randomly dropped nodes and throw a genetic algorithm at it. So many possibilites.

rmwkwok · February 28, 2025, 11:41am

You may be interested in “pruning”. For example, as this wiki article says,

The goal of this process is to maintain accuracy of the network while increasing its efficiency.

Therefore, the direction may not be like what you were thinking - to decrease variance. While you can google “neural network pruning” for more references, this article may be interesting to you, too.

Cheers!

rmwkwok · February 28, 2025, 11:44am

Reducing variance through removing trained nodes does not sound quite an ideal approach to me. In my opinion, we start from small network and grow it to reduce bias, and we add regularization to control the variance. So, it’s not most natural that we suddenly want to shrink the network for variance reduction because then we shouldn’t have grown it to such large in the first place.

dtonhofer · February 28, 2025, 12:02pm

Maybe not. But what if you have to first climb the hill to find the solution in the valley below?

Anyway, the tweet from the article you indicated to shrink the network:

It sounds like something to try.

rmwkwok · February 28, 2025, 12:53pm

This I can’t just say impossible, because I have not investigated in that. However, if people who did pruning a lot found it often improved performance, then I suppose we wouldn’t have just read this:

However, I admit that it could just be because they had focused on how to reduce network size while maintaining performance, instead of to improve performance by reducing size. For example, in that tweet, their choice of weights to prune was based on “lowest magnitudes”, but your choice was “highest-frequency changes in the output”. Intentions.

While “lowest magnitudes” is quite intuitive for their intention, the problem now is, is there any criteria for us to say that a certain node is possibly more accountable for high variance? This I am not sure.

Next, I am just sharing some thought process and it may turn out to be not making any sense at all… In high variance model, we may see large change in output when there is a small change in input, so the small change get amplified across the network? Then what kind of nodes can amplify it? If the input to the node contains only positive value, then node with large sum-of-weights will amplify it more, but if the input contains any value, then it may be difficult to say. If the output of the node is always negative, then after ReLU, it becomes zero anyway. So, the criteria seems going to be complicated?

For which intention? For reducing network size while maintaining performance? Or for reducing variance to improve performance?

rmwkwok · February 28, 2025, 1:10pm

Adoping your hill-valley analogy, were you meaning that performance of B > A while allowing the performance of C > B? I was thinking about B > A and B > C.

dtonhofer · February 28, 2025, 3:25pm

The latter.

So you start with few nodes.
Then you add nodes until you have something that low bias
But then you find it has too much variance.
So you prune nodes again.
And maybe that gives you good solution with appropriate variance & bias.

But none of that may make any sense, I’m just going by the feels.

Here is another thing that one can try for regularization:

All of this stuff is based on floating-point numbers up to 64 bits
What happens if one reduce the allowed range, switch to 16 bit floats maybe (this is also advantageous to avoid excessively complex hardware and excessive energy expenditure). What kind of range do we really need?

Maybe the network starts to generate really bad answer, but that in itself would be a sign that one is relying on little noise in the computation.

rmwkwok · March 1, 2025, 1:22am

That’s usually how things begin …

Besides pruning, another technique for shrinking the size of a trained network is called “quantization” which does exactly the kind of thing you mentioned: from 64-bit to, for example, 16-bit, which gives a 75% cut. Tensorflow has this page cover both “Post-training quantization” and “Quantization aware training”. However, again, network shrinking might not be your focus.

If we think that too much precision causes high variance, besides using 16-bit, we also have this “GaussianNoise” layer which is really considered as a regularization layer in Tensorflow and we can use it, at training, to perturb the input to a layer.

I think your ideas make a lot of sense!

Cheers,
Raymond

dtonhofer · March 1, 2025, 3:39pm

Thank you for those further pointers. As a final input to the discussion, I found this review paper:

I also updated the mindmap somewhat, adding better text.

Topic		Replies	Views
Course 2. Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	514	April 23, 2022
Network size and bias variance tradeoff Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	655	April 26, 2021
Course 2, Week 1, Exercise Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	631	June 5, 2021
Week 1: dropout vs reducing network? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	14	1389	August 19, 2023
Course 2 Week 1 Regularization AI Discussions	2	70	December 21, 2022

Complementary mindmap: Various possibilities of regularization

Related topics