Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n

In the video for Weight Initialization for Deep Networks, Andrew introduces the Kaiming He initialization. At some point Andrew mentions that w_i needs to be small and that it would be a good idea to make sure that Var(w_i)=\frac{1}{n}. However, he never seems to exactly explain why.

I’ve been looking at the article (https://arxiv.org/pdf/1502.01852v1.pdf) in which it is introduced and I would like to check if I get the idea right: so if each of the w_i's has a variance of 1/n, then all w_i's added have n\cdot\frac{1}{n} = 1 for the total variance. So that would mean that the variance of z would be 1. And if you make sure that each of the layers has this same variance of 1, you avoid that there is exponential growth or decay of the w_i's which would make the gradient explode or diminish. Is this intuition correct?

1 Like

Yes, I think that’s the correct interpretation. They explain all this in section 2.2 of the paper. The math is trifle more complex than n * 1/n, but the intent is what you expressed: to keep the variance bounded in a reasonable range as you compound the layers. By the Chain Rule, of course, you end up multiplying the gradients across the layers.

1 Like