Hi @spather
welcome to the community!
Yes, your statement in the 2nd post is true: also in non-linear activation function one can have vanishing gradients.
Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”.
- e.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions
In addition: also exploding gradients can occur for non-linear activation functions. This might be influenced by badly chosen hyperparameters, see also the links below.
Here you find some mitigation strategies, like gradient clipping or approaches and others:
- Vanishing/Exploding gradients C2W1 - #2 by Christian_Simonis
- https://machinelearningmastery.com/exploding-gradients-in-neural-networks/
Please let me know if anything is unclear, @spather and don’t hesitate to ask.
Best regards
Christian