Vanishing/Exploding Gradients when there is a non-linear activation function

Hi @spather

welcome to the community!

Yes, your statement in the 2nd post is true: also in non-linear activation function one can have vanishing gradients.
Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”.

  • e.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions

In addition: also exploding gradients can occur for non-linear activation functions. This might be influenced by badly chosen hyperparameters, see also the links below.

Here you find some mitigation strategies, like gradient clipping or approaches and others:

Please let me know if anything is unclear, @spather and don’t hesitate to ask.

Best regards