I loved this video in Course 4, Week 2 called “Why ResNets Work?” in giving me some intuition about why adding the identity via skip dampens against vanishing gradient.
Maybe the answer to this is another video, but, what about exploding gradients? It seems like to protect against that we’d want to add, I don’t know, some like of log() function on Wx+b term, in addition to identity on skipping a term? e.g. g( (Wx+b) + (a_-1) + log(Wx+b) )
In other words, without something like this, does skipping help ResNets from exploding in addition to vanishing?
Thank you Andrew & Mentors! this course is great!
Ken
(btw, i searched to see if this question has been asked already & couldn’t find. apologies, if i just missed it somehow)