The following question marks this option as incorrect:
“The skip connections compute a complex non-linear function of the input to pass to a deeper layer in the network”
The explanation given is:
“This is false, skip connections help the model to learn an identity mapping, not a complex non-linear function”
I believe it does both. Isn’t g(z[l]+a[l-k]) more complex or more non-linear than g(z[l]) alone depending on the choice of g?
Question:
“Which ones of the following statements on Residual Networks are true? (Check all that apply.)”
From the original ResNet Deep Residual Learning for Image Recognition:
This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
Hence,
hmm, looks like its only true of g is relu… would be interested to know if you’d get similar result when g is non linear