Dear Deeplearning.ai team,
I’m writing regarding the video ‘Weight Initialization for Deep Networks’ part 2.22 - 2.58 (variance’s correction in case of using ReLU as an activation function).
In this part of video, it’s said that
“It turns out that if you’re using a ReLu activation function that, rather than 1 over n it turns out that, set in the variance of 2 over n works a little bit better.”
Let’s consider x - a random variable from Normal Distribution with parameters: mean = 0, variance = 1.
Then the random variable y = max{0, x} is going to have mean = 1 / sqrt(2*pi), variance=1/2. Links with the computations: mean, variance
If we change the distribution of x into Normal: mean=0, variance=1/n
Then y = max{0, x} will have mean = 1 / ( sqrt(2pi) * sqrt(n)) , variance=1 / (2n). Links with the computations: mean , variance
So, it seems like we get not 2/n as in the video, but 1/(2*n), doesn’t it?
I would be so much grateful for your comments.
Thanks a lot!