In the video for Weight Initialization for Deep Networks, Andrew introduces the Kaiming He initialization. At some point Andrew mentions that w_i needs to be small and that it would be a good idea to make sure that Var(w_i)=\frac{1}{n}. However, he never seems to exactly explain why.
I’ve been looking at the article (https://arxiv.org/pdf/1502.01852v1.pdf) in which it is introduced and I would like to check if I get the idea right: so if each of the w_i's has a variance of 1/n, then all w_i's added have n\cdot\frac{1}{n} = 1 for the total variance. So that would mean that the variance of z would be 1. And if you make sure that each of the layers has this same variance of 1, you avoid that there is exponential growth or decay of the w_i's which would make the gradient explode or diminish. Is this intuition correct?