Why is Var(Wi) = 1/n?

vxidh · July 1, 2024, 10:23am

In the video for weight initialization Prof Ng asks us to set the variance of Wi as 1/n but doesn’t really provide a reason for the same. I also went through the arXiV paper Kaiming He initialization but I really can’t seem to wrap my head around why it’s 1/n.
Could someone help me out here and explain in layman terms?

Nevermnd · July 1, 2024, 11:18am

@vxidh from an intuitive standpoint the big issue is we don’t want our gradients to start of too small, in which case they might vanish or disappear entirely as we start to optimize, but also not too large, such that they explode to an enormous size and we basically overflow.

Thus it is a matter of striking a careful balance and 1 / n (with n being the number of input layers to the present node) (or 2 / n, in the case of ReLU) allows us a reasonable measure to scale the initial value relative to the problem at hand.

It is a little like ‘Goldilocks and the Three Bears’

** Figure, also, though still slightly random, scaling by n ensures the initial weights of each layer remain proportional/balanced to one another, at least at the beginning.

Topic		Replies	Views
Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	546	February 9, 2022
C2W1 Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	815	May 28, 2021
Clarification for the variance 1/n Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	3	16	November 11, 2024
Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	7	152	June 17, 2024
Doubt on weight initialization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	392	August 19, 2023

Why is Var(Wi) = 1/n?

Related topics