Skip connections in ResNets


it is said, also in this course, that skip connections would enable the network to easier learn ithe identity function (which is linear). On the other hand on the slides each skip connection is fed, after skipping one, two or three blocks, into the (then following) nonlinearity (ReLU etc.) in addition to the (then following) activation z. This causes a nonlinearity lying also on every “skip”-path though the network. The ability to learn the identity function (or a linear function) is therfore restricted on a very short partial path though the network (very local). At least as I see it, this reduces the ability of the network to learn the (linear) identity function, because also every (global) “skip” signal path is distorted by a nonlinearity / many nonlinearities – therefore making it more difficult for the gradient to backpropagate also on these paths. Why is there no global path from input to output that contains no nonlinearity at all, when the goal for the network is to better learn the identity function (or a linear function)?

Thank you very much in advance – Uwe :slight_smile:

Hello @Uwe_Z,

For example, in the U-Net architecture, we have longer skip connections:

You will get a chance to work with the U-Net model in Course 4, week 3.

The goal of the network is to learn a mapping function from input to output. The optimal mapping is very rarely the identity mapping, only if inputs are equal to outputs. Skip connections help gradients flow better, thus improving learning and our chances of learning a good mapping from input to output, involving a deep architecture, with several consecutive layers.


Thank you very much for your answer! Please excuse my late reply. I think the U-Net you showed answers my question. There are paths of increasing nonlinearities through the network. The top path contains only very few nonlinearities, the paths going through the layers below include more nonlinearities. Thank you again for your help :slight_smile:

1 Like