Prof said “very deep neural networks are difficult to train because of vanishing and exploding gradient problems" ,and this led to the result of the graph on the left.
Q1. I think that “exploding gradient” could lead to “rising training error” because exploding gradient could make the gradient descent step too big.
Is what I think right? Is there any other reason “exploding gradient” could lead to “rising training error”?
Q2. I do not think that “vanishing gradient” could lead to “rising training error” because vanishing gradient gradient could make the gradient descent step too small, and this leads to almost no change of gradient.
Is what I think right? Is there reason “vanishing gradient” could lead to “rising training error”?
Yes, the weights become too big and may become NaN during training; hence, the training error will skyrocket when this happens.
Yes, you have to remember that the parameters of higher layers might still change significantly whereas the parameters of lower layers would not change much (or not at all). Anyway, the model is still learning, but very slowly, so the training error will flatten out, but not increase, if you don’t do something weird with the learning rate.