In the video for weight initialization Prof Ng asks us to set the variance of Wi as 1/n but doesn’t really provide a reason for the same. I also went through the arXiV paper Kaiming He initialization but I really can’t seem to wrap my head around why it’s 1/n.
Could someone help me out here and explain in layman terms?
@vxidh from an intuitive standpoint the big issue is we don’t want our gradients to start of too small, in which case they might vanish or disappear entirely as we start to optimize, but also not too large, such that they explode to an enormous size and we basically overflow.
Thus it is a matter of striking a careful balance and 1 / n (with n being the number of input layers to the present node) (or 2 / n, in the case of ReLU) allows us a reasonable measure to scale the initial value relative to the problem at hand.
It is a little like ‘Goldilocks and the Three Bears’
** Figure, also, though still slightly random, scaling by n ensures the initial weights of each layer remain proportional/balanced to one another, at least at the beginning.
2 Likes