Vanishing/Exploding Gradients when there is a non-linear activation function

spather · January 13, 2023, 8:23pm

In the video about Vanishing/Exploding gradients, the example shown uses a linear activation function g(z) = z. Without any non-linearity, I understand how the activations can explode / get very small in a deep network. But does this hold when there is a non-linear activation function? E.g. if g(z)=tanh(z), the activations would be clamped between -1 and 1, so I think this would prevent activations/gradients from exploding or diminishing.

Thanks for any clarification on this point!

spather · January 13, 2023, 8:30pm

As I think about this more, I suppose that even with tanh activation, if the value of z gets very large, even though the activations may be small, the gradients would approach zero. But I’d love to get other people’s thoughts on this.

Christian_Simonis · January 13, 2023, 8:49pm

Hi @spather

welcome to the community!

Yes, your statement in the 2nd post is true: also in non-linear activation function one can have vanishing gradients.
Let’s take sigmoid or tanh. Here you have this risk because they are saturating / flattening out in the sides which makes the gradient “vanish”.

e.g. for ReLU there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh, see also: Activation functions

In addition: also exploding gradients can occur for non-linear activation functions. This might be influenced by badly chosen hyperparameters, see also the links below.

Here you find some mitigation strategies, like gradient clipping or approaches and others:

Please let me know if anything is unclear, @spather and don’t hesitate to ask.

Best regards
Christian

spather · January 13, 2023, 9:28pm

Thanks, @Christian_Simonis for the quick reply - super helpful. I’m taking away from this that with non-linear activation functions like tanh, the activations themselves can’t explode but the gradients can, if you had to multiply several gradients > 1 in succession in a super deep network. Let me know if that’s not right.

Topic		Replies	Views
The problem of expolding/vanishing Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	510	March 5, 2022
Vanishing/Exploding Activations Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	574	October 31, 2021
What causing exploding gradients? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	591	April 21, 2022
Sigmoid and tanh suffers only with vanishing gradients problem? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	798	September 11, 2023
Vanishing_exploding gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	528	September 12, 2022

Vanishing/Exploding Gradients when there is a non-linear activation function

Related topics