In Week 1 β Video (Weight Initialization for Deep Networks),
Hello,
From this given image,
Is there anyone can explain to me why when n gets larger, w_i will get smaller?
Also, the reason of taking (or assuming) the variance of w to be 1/n?
In Week 1 β Video (Weight Initialization for Deep Networks),
Hello,
From this given image,
Is there anyone can explain to me why when n gets larger, w_i will get smaller?
Also, the reason of taking (or assuming) the variance of w to be 1/n?
Since the sigmoid(z) function has a limited output range (from 0.0 to 1.0), the gradients will approach zero when the z values are large. So when you have lots of inputs to compute w*x and sum to get z, the weights must be learned to be small, to prevent the output getting into the range where the gradients are tiny.
Thank you @TMosh,
Ohhhh youβre right, that makes sense.
This helps in preventing the gradients from either vanishing or exploding as they propagate backward through the network.
This is just my intuition as I am also in the learning phase, for details we need to read the research paper.