Hey there!
In the lecture Why Residual Nets work?, Professor Ng mentioned that the vanishing and exploding gradients and how this residual connection makes the solution to alleviate these two problems.
As I understand how adding vanishing gradients (a[l+2]) with the previous layer (a[l]) can solve the problem for vanishing gradients - because the gradients of the previous layer (a[l]) are still non-zero and help the model learn.
I am having a hard time understanding how adding residual connections help with exploding gradients. Adding those two matrices will simply output an output matrix with really large values.
I would appreciate any help
Cheers!