Posting this to make sure if I’m understanding right.
-
In a very deep neural network, as we keep training it, the weight value may be very small(maybe due to L2 regularization, or weight decay, etc.).
-
Assuming that bias term is 0, in a plain network,
a[l+2] = g( w[l+2]*a[l+1] )
Due to 1), w[l+2] may be very small so that we can almost ignore the term “w[l+2]*a[l+1]”.
So a[l+2] = g(0) = 0(maybe not exactly 0, but really really small value), which means activation vanishing, and this may hurt the network’s performance.
- However, if we use residual block, what happens is
a[l+2] = g( w[l+2]*a[l+1] + a[l] )
ending up with a[l+2] = a[l], an identity function.
So <even if something goes wrong and the weight is almost 0, residual block will turn to an identity function> and keeps the activation alive.
That phrase in <> is what “residual block is easy to learn identity function” refers to, from what I understood. And that is why residual blocks help improving the performance.
Am I understanding right?