Is there a reason why the variance of weights of a particular layer to be 2/n?

hazingo · December 20, 2022, 3:13am

I don’t get how this is mathematically derived and understood. Could anyone please provide an explanation, similarly how the normalization formula is derived for tanh? Additionally, is there a variance normalization formula for sigmoid?

paulinpaloalto · December 20, 2022, 4:30am

This is all explained in the lectures. The point is not that the variance is 2/n a priori. The point is that they are trying to produce that variance for the weight initialization as a way to get better convergence behavior. Also note that Prof Ng is not saying that always works, either. There are several different algorithms, e.g. He Initialization and Xavier Initialization. There is no “silver bullet” solution that works the best in all cases. Please watch the relevant lecture again with the above thoughts in mind and hopefully it will make sense the second time through.

Topic		Replies	Views
C2W1 Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	822	May 28, 2021
Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	550	February 9, 2022
Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	7	169	June 17, 2024
Weight initialization Course 2 week 1 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	582	January 27, 2023
Questions about initialization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	703	October 30, 2021

Is there a reason why the variance of weights of a particular layer to be 2/n?

Related topics