Let’s compile a list of “parameter initialization” that we may encounter during the course or during exercises. It turns out that the naming is rather confusing:
Parameter Initializations
The heuristic prevalent when Xavier Glorot and Yoshua Bengio wrote their paper in 2010
The Glorot/Bengio paper (see the references) says this is a “commonly used heuristic” (i.e. in 2010) but it provides “bad convergence”.
Choose weights for layer l by sampling the uniform distribution over the interval:
\left[- \frac{1}{\sqrt{n_{l-1}}}, + \frac{1}{\sqrt{n_{l-1}}} \right]
n_{l-1} is the width of the preceding layer. Weights may be scaled.
The standard deviation of the resulting (uniform) distribution is \sigma = \frac{1}{\sqrt{3}} \frac{1}{\sqrt{n_{l-1}}} \approx 0.57735 \frac{1}{\sqrt{n_{l-1}}}
In the Keras library, one can use the uniform initializer tf.keras.initializers.RandomUniform
to implement this distribution, setting the minval
and maxval
appropriately.
Alternatively, Keras provides the uniform He Initializer, tf.keras.initializers.HeUniform
, which does exactly what is described here (confusing naming).
The uniform Xavier (or Glorot) initialization
The Glorot/Bengio paper proposes the formula below as “normalized initialization” to maintain activation variances and back-propagated gradients variance for tanh and softsign activation functions.
This is called either Xavier Initialization or Glorot Initialization.
Choose weights for layer l by sampling the uniform distribution over the interval:
\left[- \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}}, + \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}} \right]
n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.
The standard deviation of the resulting (uniform) distribution is \sigma = \sqrt{2} \frac{1}{\sqrt{n_{l-1}+n_{l}}} \approx 1.41421 \frac{1}{\sqrt{n_{l-1}+n_{l}}}
In the Keras library, this approach is implemented in the uniform Glorot initializer, tf.keras.initializers.GlorotUniform
The normal Xavier (or Glorot) initialization
This initialization is not given in the Glorot/Bengio paper, but it is a straightforward variation. It is the one given in the course as Xavier Initialization in “Improving Deep Neural Networks, Week 1, Weight Initialization for Deep Networks”.
Choose weights for layer l by sampling from the standard normal distribution and then multiplying by:
\sqrt{\frac{1}{n_{l-1}}} or \sqrt{\frac{2}{n_{l-1}+n_{l}}}
which is the same as sampling from the normal distribution with mean zero and the above standard deviation.
n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.
In the Keras library:
- the first approach, depending only on fan-in, is implemented in the normal LeCun initializer:
tf.keras.initializers.LecunNormal
- the second approach, depending on the sum of fan-in and fan-out, is implemented in the normal Glorot initializer:
tf.keras.initializers.GlorotNormal
Both actually sample from a truncated Normal Distribution, where “values more than two standard deviations from the mean are discarded and re-drawn.”
He Initialization
He, Zhang, Ren, Sun propose the following in their 2015 paper as “equation 14”. This is used in particular for ReLU activation. Also used in the course exercise.
Choose weights for layer l by sampling the standard normal distribution and then multiply by:
\sqrt{\frac{2}{n_{l-1}}}
i.e. sample a normal distribution with mean zero and the the above standard deviation.
In the Keras library, the above is implemented by tf.keras.initializers.HeNormal
, which again samples from a truncated Normal Distribution.
References
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot, Yoshua Bengio
2010-01
Appears in: “Proceedings of the 13th International Conference on Artificial Intelligence and Statistics”
This paper introduces “Xavier Initialization” and ReLU Activation Function to handle exploding/vanishing gradients.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
2015-06
This paper introduction “He Initialization”