Hey there!

In the lecture **Why Residual Nets work?**, Professor Ng mentioned that the vanishing and exploding gradients and how this residual connection makes the solution to alleviate these two problems.

As I understand how adding vanishing gradients (a[l+2]) with the previous layer (a[l]) can solve the problem for vanishing gradients - because the gradients of the previous layer (a[l]) are still non-zero and help the model learn.

I am having a hard time understanding how adding residual connections help with exploding gradients. Adding those two matrices will simply output an output matrix with really large values.

I would appreciate any help

Cheers!