Intuition on weight initialization

We learnt in these lectures -

  1. Course 1 - Week 3 - Random Initialization that
  • don’t initialize all weights to 0 - causes symmetry and all nodes to get updated similarly
  1. Course 2 - Week 1 - Normalizing inputs
  • normalize inputs to have a mean of 0 and variance of 1 - i.e., if it’s a normal distribution, 68% of the data is within 1.
  1. Course 2 - Week 1 - Weight Initialization
  • initialize weights to close to 0 - with a standard deviation of sqrt(1/n) (for sigmoid) or sqrt(2/n) (for relu)

Questions -

  • Why don’t small weights cause the solution to become symmetric with at least some nodes given equal weights? I can generally see that the network will try to incentivize different neurons differently to get to the minimum cost, but why wouldn’t it just get to some symmetric (local minima)? What’s the intuitive explanation?

  • On a related note, shouldn’t there be something in the cost (kinda like L2 regularization, but except a different function of W) that keeps the weights away from each other to break symmetry? Does that make sense, and has it been done?

  • For the vanishing gradient problem, isn’t it the case that the derivative of the activation function evaluated at z should not be too small? That’s the multiplier we use for the activation during the backward propagation? Doesn’t it then follow that the value of z should neither be too high nor too low (and not necessarily close to 1.0)? The intuition for z being close to 1.0 was based on forward propagation after simplifying the network - why/how does that hold for backward propagation of derivatives?

  • With roughly 68% of weights being within 1/n (variance = 1/n, assuming normal distribution) and roughly 68% of the n inputs also being within +1 and -1 (normalization of input) - why would the “z” computed be close to 1.0 like is mentioned in the lecture?

For questions 1) and 2), note that not equal is not equal, right? We’re working in 64 bit floating point here, which is pathetic compared to the pure math beauty of \mathbb{R}, but you still have quite a bit of resolution. The resolution just of the mantissa in binary64 float (see the IEEE 754 spec) is on the order of 10^{-16} and the exponent range is roughly -307 to +308 (base 10). So all we care is that the weights start as different values. Then the learning will produce unique values. This doesn’t prove that every weight will be different, but we just care that each whole weight vector per neuron is unique.

For question 3), I think you are confusing the value of z with the derivative of the activation function at that point. It all depends on what the activation function is, right? If it is a function like tanh or sigmoid which has “flat tails”, then large absolute values of z will have the vanishing gradient problem. With ReLU, it is only negative values that have that problem.

For question 4), I’ll have to go back and watch this lecture again. It’s been a while …