This is my first post in this community - so please be gentle if I break any rules
This in the context of [Deep Learning Specialization] → [Course 2 - Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization] → [Section: Setting up your Optimization Problem] → [Video Title: Vanishing/Exploding Gradients].
I think I understand why derivatives being small (caused by very deep neural networks) can lead to slower training times (given we have to take more steps to reach the optima while doing the gradient descent). However, I don’t understand why a large derivative is a problem. As long as the large descent is following the “curve” of the loss function 'J’it would get us to the “bottom of the valley” sooner, no? This was not explained in the video