Parameter initialization question

Hi, I understand that for W initialization when we multiply by np.sqrt(1/n^{[l-1]}) could prevent exploding gradient since it could make W smaller. But I am confused that how would this prevent vanishing gradient. Since if wi is smaller than 1, multiply it by np.sqrt(1/n^{[l-1]}) could make it even smaller, wouldn’t this make gradient vanishing worse?

Hi @Lily007

welcome to the community and thanks for your first question!

The purpose of weight matrix matrix initialisation is rather to keep the initialized weights in a first reasonable range. So it’s especially this comparable range that we want to achieve so that we are well prepared to go ahead with our optimization to tune the weights to solve the optimization problem.

Specifically:

The goal of Xavier Initialization is to initialize the weights such that the variance of the activations are the same across every layer. This constant variance helps prevent the gradient from exploding or vanishing.

see also: Section 4 (Week 4)

Assuming that this reasonable range is given; my experience is that the vanishing gradient problem is rather caused by activation functions:

Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”. E.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions

see also this thread.

Hope that helps!

Best regards
Christian