A little note on parameter initializations (Glorot, Xavier, He)

dtonhofer · March 2, 2025, 5:40pm

Let’s compile a list of “parameter initialization” that we may encounter during the course or during exercises. It turns out that the naming is rather confusing:

Parameter Initializations

The heuristic prevalent when Xavier Glorot and Yoshua Bengio wrote their paper in 2010

The Glorot/Bengio paper (see the references) says this is a “commonly used heuristic” (i.e. in 2010) but it provides “bad convergence”.

Choose weights for layer l by sampling the uniform distribution over the interval:

\left[- \frac{1}{\sqrt{n_{l-1}}}, + \frac{1}{\sqrt{n_{l-1}}} \right]

n_{l-1} is the width of the preceding layer. Weights may be scaled.

The standard deviation of the resulting (uniform) distribution is \sigma = \frac{1}{\sqrt{3}} \frac{1}{\sqrt{n_{l-1}}} \approx 0.57735 \frac{1}{\sqrt{n_{l-1}}}

In the Keras library, one can use the uniform initializer tf.keras.initializers.RandomUniform to implement this distribution, setting the minval and maxval appropriately.

Alternatively, Keras provides the uniform He Initializer, tf.keras.initializers.HeUniform, which does exactly what is described here (confusing naming).

The uniform Xavier (or Glorot) initialization

The Glorot/Bengio paper proposes the formula below as “normalized initialization” to maintain activation variances and back-propagated gradients variance for tanh and softsign activation functions.

This is called either Xavier Initialization or Glorot Initialization.

Choose weights for layer l by sampling the uniform distribution over the interval:

\left[- \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}}, + \frac{\sqrt{6}}{\sqrt{n_{l-1}+n_{l}}} \right]

n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.

The standard deviation of the resulting (uniform) distribution is \sigma = \sqrt{2} \frac{1}{\sqrt{n_{l-1}+n_{l}}} \approx 1.41421 \frac{1}{\sqrt{n_{l-1}+n_{l}}}

In the Keras library, this approach is implemented in the uniform Glorot initializer, tf.keras.initializers.GlorotUniform

The normal Xavier (or Glorot) initialization

This initialization is not given in the Glorot/Bengio paper, but it is a straightforward variation. It is the one given in the course as Xavier Initialization in “Improving Deep Neural Networks, Week 1, Weight Initialization for Deep Networks”.

Choose weights for layer l by sampling from the standard normal distribution and then multiplying by:

\sqrt{\frac{1}{n_{l-1}}} or \sqrt{\frac{2}{n_{l-1}+n_{l}}}

which is the same as sampling from the normal distribution with mean zero and the above standard deviation.

n_{l-1} is the width of the preceding layer. n_{l} is the width of the current layer. Weights may be scaled.

In the Keras library:

the first approach, depending only on fan-in, is implemented in the normal LeCun initializer: tf.keras.initializers.LecunNormal
the second approach, depending on the sum of fan-in and fan-out, is implemented in the normal Glorot initializer: tf.keras.initializers.GlorotNormal

Both actually sample from a truncated Normal Distribution, where “values more than two standard deviations from the mean are discarded and re-drawn.”

He Initialization

He, Zhang, Ren, Sun propose the following in their 2015 paper as “equation 14”. This is used in particular for ReLU activation. Also used in the course exercise.

Choose weights for layer l by sampling the standard normal distribution and then multiply by:

\sqrt{\frac{2}{n_{l-1}}}

i.e. sample a normal distribution with mean zero and the the above standard deviation.

In the Keras library, the above is implemented by tf.keras.initializers.HeNormal, which again samples from a truncated Normal Distribution.

References

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot, Yoshua Bengio

2010-01

Appears in: “Proceedings of the 13th International Conference on Artificial Intelligence and Statistics”

This paper introduces “Xavier Initialization” and ReLU Activation Function to handle exploding/vanishing gradients.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015-06

This paper introduction “He Initialization”

Topic		Replies	Views
Q: Should I use the Glorot Uniform Initializer or Xavier? A: Yes Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	531	March 25, 2025
Glorot Initializer for Bias? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	619	January 5, 2022
Initializers with Keras Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	418	July 7, 2023
Week 1, W initialization to large random number, and HE Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	524	August 31, 2021
W2_Programming Assignment: Residual Networks Convolutional Neural Networks week-2 , ai-discussions , coursera-platform	15	36	September 1, 2024