A little note on parameter initializations (Glorot, Xavier, He)

Let’s compile a list of “parameter initialization” that we may encounter during the course or during exercises. It turns out that the naming is rather confusing:

Parameter Initializations

The heuristic prevalent when Xavier Glorot and Yoshua Bengio wrote their paper in 2010

The Glorot/Bengio paper (see the references) says this is a “commonly used heuristic” (i.e. in 2010) but it provides “bad convergence”.

Choose weights for layer l by sampling the uniform distribution over the interval:

\left[- \frac{1}{\sqrt{n_{l-1}}}, + \frac{1}{\sqrt{n_{l-1}}} \right]

n_{l-1} is the width of the preceding layer. Weights may be scaled.

The standard deviation of the resulting (uniform) distribution is \sigma = \frac{1}{\sqrt{3}} \frac{1}{\sqrt{n_{l-1}}} \approx 0.57735 \frac{1}{\sqrt{n_{l-1}}}

In the Keras library, one can use the uniform initializer tf.keras.initializers.RandomUniform to implement this distribution, setting the minval and maxval appropriately.

Alternatively, Keras provides the uniform He Initializer, tf.keras.initializers.HeUniform, which does exactly what is described here (confusing naming).

The uniform Xavier (or Glorot) initialization

The Glorot/Bengio paper proposes the formula below as “normalized initialization” to maintain activation variances and back-propagated gradients variance for tanh and softsign activation functions.

This is called either Xavier Initialization or Glorot Initialization.

Choose weights for layer l by sampling the uniform distribution over the interval:

\left[- \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}}, + \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}} \right]

n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.

The standard deviation of the resulting (uniform) distribution is \sigma = \sqrt{2} \frac{1}{\sqrt{n_{l-1}+n_{l}}} \approx 1.41421 \frac{1}{\sqrt{n_{l-1}+n_{l}}}

In the Keras library, this approach is implemented in the uniform Glorot initializer, tf.keras.initializers.GlorotUniform

The normal Xavier (or Glorot) initialization

This initialization is not given in the Glorot/Bengio paper, but it is a straightforward variation. It is the one given in the course as Xavier Initialization in “Improving Deep Neural Networks, Week 1, Weight Initialization for Deep Networks”.

Choose weights for layer l by sampling from the standard normal distribution and then multiplying by:

\sqrt{\frac{1}{n_{l-1}}} or \sqrt{\frac{2}{n_{l-1}+n_{l}}}

which is the same as sampling from the normal distribution with mean zero and the above standard deviation.

n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.

In the Keras library:

Both actually sample from a truncated Normal Distribution, where “values more than two standard deviations from the mean are discarded and re-drawn.”

He Initialization

He, Zhang, Ren, Sun propose the following in their 2015 paper as “equation 14”. This is used in particular for ReLU activation. Also used in the course exercise.

Choose weights for layer l by sampling the standard normal distribution and then multiply by:

\sqrt{\frac{2}{n_{l-1}}}

i.e. sample a normal distribution with mean zero and the the above standard deviation.

In the Keras library, the above is implemented by tf.keras.initializers.HeNormal, which again samples from a truncated Normal Distribution.

References

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot, Yoshua Bengio

2010-01

Appears in: “Proceedings of the 13th International Conference on Artificial Intelligence and Statistics”

This paper introduces “Xavier Initialization” and ReLU Activation Function to handle exploding/vanishing gradients.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015-06

This paper introduction “He Initialization”

5 Likes