# Vanishing/Exploding Gradients when there is a non-linear activation function

In the video about Vanishing/Exploding gradients, the example shown uses a linear activation function g(z) = z. Without any non-linearity, I understand how the activations can explode / get very small in a deep network. But does this hold when there is a non-linear activation function? E.g. if g(z)=tanh(z), the activations would be clamped between -1 and 1, so I think this would prevent activations/gradients from exploding or diminishing.

Thanks for any clarification on this point!

As I think about this more, I suppose that even with tanh activation, if the value of z gets very large, even though the activations may be small, the gradients would approach zero. But I’d love to get other people’s thoughts on this.

Hi @spather

welcome to the community!

Yes, your statement in the 2nd post is true: also in non-linear activation function one can have vanishing gradients.
Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”.

• e.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions