Having large wight matrix will lead to vanishing gradient or exploding gradients

I remember Andrew has mentioned that if weight matrix is greater than Identity matrix(in general sense) the model will have exploding gradient where activations will increase drastically layer by layer and similar will happen to gradients as well.

But in sigmoid activation function for larger W , we will have larger Z which will lead to very low gradients. Because slope of sigmoid at larger Z is very low (almost parallel to x axis)

The above two ideas are contradicting.

Here are several thoughts:

  • The issue is not with the magnitude of the weights, it’s the magnitude of the gradients.
  • The magnitude of the gradients is controlled largely by the magnitude of the features.
  • sigmoid() is not the only activation function.
  • Because sigmoid has very small gradients for large values of ‘z’, its problem is with vanishing gradients, not exploding.