In Week 1 β Video (Weight Initialization for Deep Networks),

Hello,

From this given image,

Is there anyone can explain to me why when n gets larger, w_i will get smaller?

Also, the reason of taking (or assuming) the variance of w to be 1/n?

In Week 1 β Video (Weight Initialization for Deep Networks),

Hello,

From this given image,

Is there anyone can explain to me why when n gets larger, w_i will get smaller?

Also, the reason of taking (or assuming) the variance of w to be 1/n?

Since the sigmoid(z) function has a limited output range (from 0.0 to 1.0), the gradients will approach zero when the z values are large. So when you have lots of inputs to compute w*x and sum to get z, the weights must be learned to be small, to prevent the output getting into the range where the gradients are tiny.

1 Like

Thank you @TMosh,

Ohhhh youβre right, that makes sense.

- Consider a single neuron with n input connections. The output of this neuron is typically a weighted sum of the inputs plus a bias term.
- If the inputs have zero mean and unit variance, and if the weights are initialized with zero mean and variance Ο^2, the variance of the weighted sum (output of the neuron before applying the activation function) is nΟ^2.
- To maintain a unit variance for the output, we set n x Ο^2=1. Thus, Ο^2=1/n.

This helps in preventing the gradients from either vanishing or exploding as they propagate backward through the network.

*This is just my intuition as I am also in the learning phase, for details we need to read the research paper.*

1 Like