Hello,
it is said, also in this course, that skip connections would enable the network to easier learn ithe identity function (which is linear). On the other hand on the slides each skip connection is fed, after skipping one, two or three blocks, into the (then following) nonlinearity (ReLU etc.) in addition to the (then following) activation z. This causes a nonlinearity lying also on every “skip”-path though the network. The ability to learn the identity function (or a linear function) is therfore restricted on a very short partial path though the network (very local). At least as I see it, this reduces the ability of the network to learn the (linear) identity function, because also every (global) “skip” signal path is distorted by a nonlinearity / many nonlinearities – therefore making it more difficult for the gradient to backpropagate also on these paths. Why is there no global path from input to output that contains no nonlinearity at all, when the goal for the network is to better learn the identity function (or a linear function)?
Thank you very much in advance – Uwe