Why ResNets work? weight decay causes activations to be same


In the video on ResNets, Andrew mentioned that due to weight decay, the weights and bias can become zero.
Since we are using ReLU i.e a >= 0

a[l+2] = W[l+2] * a[l+1] + b[l+2] + a[l]
a[l+2] = a[l]

So, in this case, does this mean we can discard the 2 layers as both the activations are the same or is it still required?
Can someone help me?


You can not discard them because they are built in the model, and in different scenarios might not be “0”. If they come close to 0 then then they give no contribution, thats their effect.

It might be worth watching the lectures again. I don’t think Prof Ng uses the term “weight decay” anywhere there. What he does talk about is vanishing and exploding gradients, which make it difficult to successfully train very deep networks, which is to say networks with lots of layers. The innovative technique of Residual Networks is that they use the parallel “skip” connections as an alternate pathway through some of the layers of the network and it turns out that having that has a moderating effect on the training and helps to keep the gradients from vanishing or exploding. As Gent says, you need to keep both connections, because both are part of the network and they work together. As you’ll also hear Prof Ng say in the lectures, the point is not that the goal is to learn the “identity mapping” on the skip connection, since that wouldn’t be a very interesting solution: the point is that having that alternative also participating in the training helps to keeps things “on the rails” meaning that it is more likely that the training will end up converging and giving a useful solution.