In the video about Vanishing/Exploding gradients, the example shown uses a linear activation function g(z) = z. Without any non-linearity, I understand how the activations can explode / get very small in a deep network. But does this hold when there is a non-linear activation function? E.g. if g(z)=tanh(z), the activations would be clamped between -1 and 1, so I think this would prevent activations/gradients from exploding or diminishing.
Thanks for any clarification on this point!
As I think about this more, I suppose that even with tanh activation, if the value of z gets very large, even though the activations may be small, the gradients would approach zero. But I’d love to get other people’s thoughts on this.
Hi @spather
welcome to the community!
Yes, your statement in the 2nd post is true: also in non-linear activation function one can have vanishing gradients.
Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”.
- e.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions
In addition: also exploding gradients can occur for non-linear activation functions. This might be influenced by badly chosen hyperparameters, see also the links below.
Here you find some mitigation strategies, like gradient clipping or approaches and others:
Please let me know if anything is unclear, @spather and don’t hesitate to ask.
Best regards
Christian
Thanks, @Christian_Simonis for the quick reply - super helpful. I’m taking away from this that with non-linear activation functions like tanh, the activations themselves can’t explode but the gradients can, if you had to multiply several gradients > 1 in succession in a super deep network. Let me know if that’s not right.