So, what is vanishing/exploding gradient?

It’s more complex than that. Besides architecture and initialisation of weights among others also hyperparameters or the activation function play a role, when it comes to ensure an effective training (with stable gradients), see also this thread: Activation functions - #2 by Christian_Simonis

Did you watch Andrew’s video?

It’s really good and illustrative, explaining what can happen with the gradient during training & backprop. I would recommend to watch this - it’s really worth it!

Best regards