C4W2: About what "Residual block is easy to learn identity function" means

Posting this to make sure if I’m understanding right.

  1. In a very deep neural network, as we keep training it, the weight value may be very small(maybe due to L2 regularization, or weight decay, etc.).

  2. Assuming that bias term is 0, in a plain network,
    a[l+2] = g( w[l+2]*a[l+1] )

Due to 1), w[l+2] may be very small so that we can almost ignore the term “w[l+2]*a[l+1]”.
So a[l+2] = g(0) = 0(maybe not exactly 0, but really really small value), which means activation vanishing, and this may hurt the network’s performance.

  1. However, if we use residual block, what happens is
    a[l+2] = g( w[l+2]*a[l+1] + a[l] )
    ending up with a[l+2] = a[l], an identity function.
    So <even if something goes wrong and the weight is almost 0, residual block will turn to an identity function> and keeps the activation alive.

That phrase in <> is what “residual block is easy to learn identity function” refers to, from what I understood. And that is why residual blocks help improving the performance.

Am I understanding right?

Hello @allegro6335,

It seems to me you are trying to understand Resnet from the angle of how it was introduced to solve a problem. I strongly recommend you to read even just the pretty short Introduction section of the Resnet Paper, and I am sure you can find such angle from the authors of Renset. Let me know your throughts or, if any, your most recent understanding, and we can discuss them. After you read the introduction section and if you still want to talk about your understanding in the first post, also let me know.


1 Like