Since both sigmoid and tanh are saturated functions at both extremes, the networks that use sigmoid(or tanh) suffer only with vanishing gradients but not exploding gradient problems?
Because when the sigmoid or Tanh’s input Z, which is also the activation of the previous layer is either extremely smaller or extremely larger, then the derivative of the sigmoid will always be closer to zero aka smaller. This means, the gradient at the output layer when using a sigmoid activation will always be closer to zero be it exploding or vanishing “activations”. And the sigmoid activation always suffers from a vanishing gradient rather than an exploding gradient?
Can someone help me understand how the exploding gradient problem occurs when using the sigmoid activation function?
Kindly correct me if my comprehension of vanishing gradients is totally wrong.
Well although the sigmoid function itself does not inherently cause the exploding gradient problem, the problem can still occur in practice, especially in deep networks.
The exploding gradient problem arises when the gradients become extremely large during training. While the sigmoid function doesn’t naturally cause this, other factors in the network can contribute to it. For example:
Initialization: If weights are initialized too large or if you have a poor choice of initialization method, it can lead to gradients that grow exponentially during backpropagation.
Poorly designed architectures: Deep networks with a large number of layers can exacerbate the issue. If the network is not properly designed, it can become more susceptible to exploding gradients.
While the sigmoid function itself doesn’t push gradients to explode, in practice, it can still happen due to these other factors.
Another cause of exploding gradients: When the input features have a wide range of values, and you’re using a fixed learning rate. It can be difficult to find a learning rate that is large enough to converge in a reasonable amount of time but isn’t so large as to cause their gradients to explode.