Weight Initialization for Deep Networks

when we set the var(w) to 2/n it will reduce the value of W as n increases, so it will prevent the algorithm from exploding (and not vanishing! ). so what is the strategy to prevent vanishing the gradients? it seems that vanishing is not as big of a deal as exploding. is it true?
Thanks in advance

The point of keeping the magnitudes of the initial values of the weights small is that you avoid the possibility of saturating the activation functions like tanh and sigmoid that have curves that flatten out away from the origin. The flattening causes both vanishing gradients or in the worst case NaN values for the cost if you get unlucky enough that sigmoid rounds to exactly 1. From a pure math point of view, the output is never exactly 0 or 1, but in the limited world of floating point numbers you don’t have to go very far in order to round to 1 on the positive side. I think sigmoid(z) == 1 for z > 35 or so.

1 Like

Thank you Paulinpaloalto for your explanation. What if we use Relu as activation function for hidden layars? so in the case of Relu it seems that setting var(w) to 2/n is just preventing the gradients from exploding.

Well, with ReLU the gradient is either 0 or 1, right? So you’re more likely to have vanishing gradients at the hidden layers and “dead neurons” at least in the hidden layers. But remember that you’ve still got the output layer to worry about and that will either be sigmoid (for binary classifiers) or softmax (for the multi-class case) and you still have to worry about vanishing gradients there. That’s why it helps to start with relatively small weight values at all layers: if the W values are larger, then the ReLU outputs that don’t get zapped to zero can be large. Large inputs to sigmoid can trigger either saturation or vanishing gradients because of the way the curve flattens out for large absolute values of the inputs.

1 Like

You’re right the gradient of ReLU with respect to Z is 0 or 1, but when we are doing backprop we have dz[l] = W[l+1]T * dz[l+1]*(dA[l]/dz[l]) where l is the number of layar. so it seems that the value of W is affecting the gradient and high values of W may cause the exploding gradients kind of problem for which we set var(w) = 2/n. As you mentioned when the gradient is 0 we have the vanishing problem but what is the strategy to prevent this? I mean the He or Xavier initialization is not helping the vanishing problem for hidden layers in case we have ReLU activation functions.

If you have problems with too many dead neurons when you use ReLU in the hidden layers, then the next thing to try is Leaky ReLU instead. It’s almost as cheap to compute and does not zero the negative inputs.


yeah, it’s a good idea it sounds helpful. thanks for your explanation.