Weight initialisation

In Week 1 – Video (Weight Initialization for Deep Networks),


From this given image,

Is there anyone can explain to me why when n gets larger, w_i will get smaller?
Also, the reason of taking (or assuming) the variance of w to be 1/n?

Since the sigmoid(z) function has a limited output range (from 0.0 to 1.0), the gradients will approach zero when the z values are large. So when you have lots of inputs to compute w*x and sum to get z, the weights must be learned to be small, to prevent the output getting into the range where the gradients are tiny.

1 Like

Thank you @TMosh,

Ohhhh you’re right, that makes sense.

  • Consider a single neuron with n input connections. The output of this neuron is typically a weighted sum of the inputs plus a bias term.
  • If the inputs have zero mean and unit variance, and if the weights are initialized with zero mean and variance Οƒ^2, the variance of the weighted sum (output of the neuron before applying the activation function) is nΟƒ^2.
  • To maintain a unit variance for the output, we set n x Οƒ^2=1. Thus, Οƒ^2=1/n.

This helps in preventing the gradients from either vanishing or exploding as they propagate backward through the network.

This is just my intuition as I am also in the learning phase, for details we need to read the research paper.

1 Like