In the lecture Why Residual Nets work?, Professor Ng mentioned that the vanishing and exploding gradients and how this residual connection makes the solution to alleviate these two problems.
As I understand how adding vanishing gradients (a[l+2]) with the previous layer (a[l]) can solve the problem for vanishing gradients - because the gradients of the previous layer (a[l]) are still non-zero and help the model learn.
I am having a hard time understanding how adding residual connections help with exploding gradients. Adding those two matrices will simply output an output matrix with really large values.
I would appreciate any help