In the lecture, they show through a series of assumptions where 𝑤[𝑙+2]
and the bias are set to 0 that “a[l+2]=a[l]” can occur.
However, I am confused about why this implies an identity function and why it means ResNet is easier to train.
Here’s how I understand it: if training fails between
a[l] and a[l+1],
then a[l+2] can back up from a[l] instead of relying on the faulty a[l+1]. In that case, the network can simply ignore the weights coming from a[l+1] (like backup system)
Is my understanding correct?
and…is there an underlying assumption that the identity function is difficult to create by backprop??
Hi @sunblockisneeded
If learning fails for the transformation between a[l] and a[l+1] , the residual connection allows a[l+2] to simply copy a[l] via the identity mapping. This bypasses the faulty transformation, helps the network to fallback on the earlier representation. This makes training easier because the residual connection provides a direct gradient flow during backpropagation.
Just like you said, ResNet
simplifies this by explicitly providing the identity mapping through skip connections that reduces the load on the network to learn it from scratch.
Hope it helps! Feel free to ask if you need further assistance.
thx for your answer. Your explanation helped me to have a more robust neural network in my brain.
You’re welcome! happy to help 