As far as I know, when weights are too much smaller or larger than one, the ACTIVATIONS either explode or vanish at some point during the forward pass. So then why don’t we call it ‘vanishing/exploding activations’ instead? When these activations explode or vanish, how does that affect backprop? can we even perform backprop? and are there situations where the activations don’t explode, but the explosion/vanish happens only during backprop?
The problem is not that the activation values explode or vanish. It is a problem when the gradients of the activation functions explode or vanish. That is because then it makes problems for the back propagation process: if the gradients vanish, then you can no longer make progress (learn). The gradients are what changes the values and if the gradients are close to zero, then there is very little change. If the gradients explode, then you can’t even converge to a solution.
Of course the gradient of any given activation function is just one factor in the Chain Rule calculation for the gradient of a given weight, but the product of small values is small and the product of large values is large.
Of course it is also the case that the behavior of the gradient values can be connected to the behavior of the underlying activation function: e.g. both sigmoid and tanh flatten out for large values of |z|, so large |z| values cause vanishing gradients with those functions.
So in the case of the relu activation function, we would only have to worry about vanishing gradients for values of z < 0?
Exactly! The behavior depends on the activation function. Of course there’s a relatively simple solution for the ReLU case if you hit this problem: switch to Leaky ReLU.