Vanshing and exploding gradient

From my understanding, the reason why gradient is relevant to deep learning is that it provides the DIRECTION of which the largest decrease in the loss function. If this is the case, then we don’t really care about the magnitude of the gradient, so why are we bothered by vanishing and exploding gradient? Is it because it becomes difficult to extract the direction when the entries are too small or large (accuracy problem with float)?

Its not only the direction but also the magnitude by which the movement towards an optima is made. If you have exploding gradients then you could overshoot (by pass) from the optima and could wondering around for a long time with no optimum found. In case of diminishing gradients then very little improvements are done and it could theoretically take much time to find an optima and therefore use much resources with low efficiency.

But can we not extract the direction and use an arbitary learning rate? It won’t quite reacht he optima but given a small learning rate, it can get within a radius I think?

the new weights after each step of gradient descent are Wnew = Wold – (α * dE/dW), there is a multiplication of learning rate and gradient.