# C2W1 Weight Initialization for Deep Networks

Dear Deeplearning.ai team,

I’m writing regarding the video ‘Weight Initialization for Deep Networks’ part 2.22 - 2.58 (variance’s correction in case of using ReLU as an activation function).

In this part of video, it’s said that

“It turns out that if you’re using a ReLu activation function that, rather than 1 over n it turns out that, set in the variance of 2 over n works a little bit better.”

Let’s consider x - a random variable from Normal Distribution with parameters: mean = 0, variance = 1.

Then the random variable y = max{0, x} is going to have mean = 1 / sqrt(2*pi), variance=1/2. Links with the computations: mean, variance

If we change the distribution of x into Normal: mean=0, variance=1/n

Then y = max{0, x} will have mean = 1 / ( sqrt(2pi) * sqrt(n)) , variance=1 / (2n). Links with the computations: mean , variance

So, it seems like we get not 2/n as in the video, but 1/(2*n), doesn’t it?

Thanks a lot!

Hi, @g.dychko.

Great question. Feel free to correct me if I got it wrong.

You are not far off. In the lecture, 2/n is the variance of w[l], not y[l]. From the He initialization paper, the expression for the latter is (you can see the derivation on page 4). For the whole network you have . To prevent the product from becoming exponentially large or small they set . Solving for Var(w[l]) you end up with 2/n.

4 Likes

Dear @nramon,