Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n

Charlie_ter_Horst · February 9, 2022, 8:02pm

In the video for Weight Initialization for Deep Networks, Andrew introduces the Kaiming He initialization. At some point Andrew mentions that w_i needs to be small and that it would be a good idea to make sure that Var(w_i)=\frac{1}{n}. However, he never seems to exactly explain why.

I’ve been looking at the article (https://arxiv.org/pdf/1502.01852v1.pdf) in which it is introduced and I would like to check if I get the idea right: so if each of the w_i's has a variance of 1/n, then all w_i's added have n\cdot\frac{1}{n} = 1 for the total variance. So that would mean that the variance of z would be 1. And if you make sure that each of the layers has this same variance of 1, you avoid that there is exponential growth or decay of the w_i's which would make the gradient explode or diminish. Is this intuition correct?

paulinpaloalto · February 9, 2022, 9:58pm

Yes, I think that’s the correct interpretation. They explain all this in section 2.2 of the paper. The math is trifle more complex than n * 1/n, but the intent is what you expressed: to keep the variance bounded in a reasonable range as you compound the layers. By the Chain Rule, of course, you end up multiplying the gradients across the layers.

Topic		Replies	Views
Why is Var(Wi) = 1/n? Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	1	51	July 1, 2024
Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	7	152	June 17, 2024
Weight initialisation Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	3	48	July 1, 2024
C2W1 Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	815	May 28, 2021
Clarification for the variance 1/n Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	3	16	November 11, 2024

Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n

Related topics