C2W1 Weight Initialization for Deep Networks

Dear Deeplearning.ai team,

I’m writing regarding the video ‘Weight Initialization for Deep Networks’ part 2.22 - 2.58 (variance’s correction in case of using ReLU as an activation function).

In this part of video, it’s said that

“It turns out that if you’re using a ReLu activation function that, rather than 1 over n it turns out that, set in the variance of 2 over n works a little bit better.”

Let’s consider x - a random variable from Normal Distribution with parameters: mean = 0, variance = 1.

Then the random variable y = max{0, x} is going to have mean = 1 / sqrt(2*pi), variance=1/2. Links with the computations: mean, variance

If we change the distribution of x into Normal: mean=0, variance=1/n

Then y = max{0, x} will have mean = 1 / ( sqrt(2pi) * sqrt(n)) , variance=1 / (2n). Links with the computations: mean , variance

So, it seems like we get not 2/n as in the video, but 1/(2*n), doesn’t it?

I would be so much grateful for your comments.
Thanks a lot!

Hi, @g.dychko.

Great question. Feel free to correct me if I got it wrong.

You are not far off. In the lecture, 2/n is the variance of w[l], not y[l]. From the He initialization paper, the expression for the latter is var_y (you can see the derivation on page 4). For the whole network you have var. To prevent the product from becoming exponentially large or small they set eq. Solving for Var(w[l]) you end up with 2/n.

Hope that made sense :slight_smile:

4 Likes

Dear @nramon,

Your explanation sounds absolutely convincing.
Thank you enormously for the link!

With best regards,
Galia

1 Like

Thank you, I’m really glad I could help.

Keep up the great work :slight_smile: