Doubt on He initialization

In order to avoid the issue of vanishing or exploding gradients, Prof. Andrew suggested that w_i should have a variance of 1/n. But why do we multiply the standard deviation while initializing using np.random.randn()? i.e., np.random.randn(shape) * np.sqrt(1/n**[l-1])

Please identify which course you are attending. Use the “pencil” tool in the thread title to move your message to the appropriate forum area.

Hi @Rhythm_Dutta this is an interesting question,

By initializing w with a standard deviation that is inversely proportional to the square root of the number of n input units in the previous layer (1/n**[l-1] term), the goal is to keep the variance of the input and output signals approximately the same across the l layers. This helps in preventing the model from vanishing/exploding gradients.

np.random.randn(shape) will give random array in which each element will vary from 0 to 1.

We multiply by np.random.randn(shape) the standard deviation term to adjust the range of the randomly initialized w according to the desired variance. Doing this, we seek to achieve stable training.