We learnt in these lectures -
- Course 1 - Week 3 - Random Initialization that
- don’t initialize all weights to 0 - causes symmetry and all nodes to get updated similarly
- Course 2 - Week 1 - Normalizing inputs
- normalize inputs to have a mean of 0 and variance of 1 - i.e., if it’s a normal distribution, 68% of the data is within 1.
- Course 2 - Week 1 - Weight Initialization
- initialize weights to close to 0 - with a standard deviation of sqrt(1/n) (for sigmoid) or sqrt(2/n) (for relu)
Questions -
-
Why don’t small weights cause the solution to become symmetric with at least some nodes given equal weights? I can generally see that the network will try to incentivize different neurons differently to get to the minimum cost, but why wouldn’t it just get to some symmetric (local minima)? What’s the intuitive explanation?
-
On a related note, shouldn’t there be something in the cost (kinda like L2 regularization, but except a different function of W) that keeps the weights away from each other to break symmetry? Does that make sense, and has it been done?
-
For the vanishing gradient problem, isn’t it the case that the derivative of the activation function evaluated at z should not be too small? That’s the multiplier we use for the activation during the backward propagation? Doesn’t it then follow that the value of z should neither be too high nor too low (and not necessarily close to 1.0)? The intuition for z being close to 1.0 was based on forward propagation after simplifying the network - why/how does that hold for backward propagation of derivatives?
-
With roughly 68% of weights being within 1/n (variance = 1/n, assuming normal distribution) and roughly 68% of the n inputs also being within +1 and -1 (normalization of input) - why would the “z” computed be close to 1.0 like is mentioned in the lecture?