Weight Initialization for Deep Networks

Reza_Aalaei · June 27, 2021, 10:41am

when we set the var(w) to 2/n it will reduce the value of W as n increases, so it will prevent the algorithm from exploding (and not vanishing! ). so what is the strategy to prevent vanishing the gradients? it seems that vanishing is not as big of a deal as exploding. is it true?
Thanks in advance

paulinpaloalto · June 28, 2021, 6:52pm

The point of keeping the magnitudes of the initial values of the weights small is that you avoid the possibility of saturating the activation functions like tanh and sigmoid that have curves that flatten out away from the origin. The flattening causes both vanishing gradients or in the worst case NaN values for the cost if you get unlucky enough that sigmoid rounds to exactly 1. From a pure math point of view, the output is never exactly 0 or 1, but in the limited world of floating point numbers you don’t have to go very far in order to round to 1 on the positive side. I think sigmoid(z) == 1 for z > 35 or so.

Reza_Aalaei · June 30, 2021, 3:58pm

Thank you Paulinpaloalto for your explanation. What if we use Relu as activation function for hidden layars? so in the case of Relu it seems that setting var(w) to 2/n is just preventing the gradients from exploding.

paulinpaloalto · June 30, 2021, 4:11pm

Well, with ReLU the gradient is either 0 or 1, right? So you’re more likely to have vanishing gradients at the hidden layers and “dead neurons” at least in the hidden layers. But remember that you’ve still got the output layer to worry about and that will either be sigmoid (for binary classifiers) or softmax (for the multi-class case) and you still have to worry about vanishing gradients there. That’s why it helps to start with relatively small weight values at all layers: if the W values are larger, then the ReLU outputs that don’t get zapped to zero can be large. Large inputs to sigmoid can trigger either saturation or vanishing gradients because of the way the curve flattens out for large absolute values of the inputs.

Reza_Aalaei · June 30, 2021, 7:58pm

You’re right the gradient of ReLU with respect to Z is 0 or 1, but when we are doing backprop we have dz[l] = W[l+1]T * dz[l+1]*(dA[l]/dz[l]) where l is the number of layar. so it seems that the value of W is affecting the gradient and high values of W may cause the exploding gradients kind of problem for which we set var(w) = 2/n. As you mentioned when the gradient is 0 we have the vanishing problem but what is the strategy to prevent this? I mean the He or Xavier initialization is not helping the vanishing problem for hidden layers in case we have ReLU activation functions.

paulinpaloalto · June 30, 2021, 8:58pm

If you have problems with too many dead neurons when you use ReLU in the hidden layers, then the next thing to try is Leaky ReLU instead. It’s almost as cheap to compute and does not zero the negative inputs.

Reza_Aalaei · July 1, 2021, 7:10pm

yeah, it’s a good idea it sounds helpful. thanks for your explanation.

Topic		Replies	Views
Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	7	152	June 17, 2024
Parameter initialization question Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	593	February 18, 2023
Weight initialisation Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	3	48	July 1, 2024
Intuition on weight initialization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	526	November 1, 2022
Question on weight initialization and exploding/vanishing gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	9	673	May 23, 2021

Weight Initialization for Deep Networks

Related topics