Why do ResNets work?

In calculating a[l+2] :slight_smile:

a[l+2] = g(z[l+2] + a[l])

You still have to go through the normal computational flow of z[l+2], in which nothing will be skipped. a[l] doesn’t replace z[l+2] and it’s just an add-on. So why do the residual blocks help training much deep network?

Right, so both A[l] and z[l+2] are used in the production of a[l+2]. If going through both layers l+1 and l+2 diminishes the weights, you still have a[l] from the residual connection and you still have the information learned up to a[l]. So the a[l+2] will give out a very similar information learned up to a[l] because the contribution of prior layers is minimal due to diminishing weights.

That makes sense. @gent.spah So the residual connection works as a backup connection, in case vanishing gradient occurs.

Yes, that’s a good way to say it. You have two paths in parallel contributing to that result and the backpropagation will happen over both routes. Note that Prof Ng did discuss all this in some detail in the lectures. If you missed this explanation, it might be worth watching them again.