In order to avoid the issue of vanishing or exploding gradients, Prof. Andrew suggested that w_i should have a variance of 1/n. But why do we multiply the standard deviation while initializing using np.random.randn()? i.e., `np.random.randn(shape) * np.sqrt(1/n**[l-1]) `

Please identify which course you are attending. Use the “pencil” tool in the thread title to move your message to the appropriate forum area.

Hi @Rhythm_Dutta this is an interesting question,

By initializing w with a standard deviation that is inversely proportional to the square root of the number of n input units in the previous layer (`1/n**[l-1]`

term), the goal is to keep the variance of the input and output signals approximately the same across the l layers. This helps in preventing the model from vanishing/exploding gradients.

`np.random.randn(shape) `

will give random array in which each element will vary from 0 to 1.

We multiply by `np.random.randn(shape) `

the standard deviation term to adjust the range of the randomly initialized w according to the desired variance. Doing this, we seek to achieve stable training.