Skipped connection in ResNet

I get that it reduces the problem of vanishing gradient but If the layers in the network are made by using short circuit connections intended to skip over the layer , then why make that ‘skipped layer’ in the first place? Won’t that defeat the purpose of ‘creating a layer’?

1 Like

Hi, @Arisha_Prasain !

I think you are seeing it as an analogy of an electrical circuit, but it is quite different here. Having these kind of connections makes the network have the input and output of the layer, not just the input as I think you are suggesting. You can always check the original paper for further information

1 Like

The skip connection represents a second (shortcut) path through the residual block, but the main path which does not take advantage of the skip connection is still computed.

The last layer in the residual block passes the sum of the shortcut path (A[k]) and the main path (Z[k+2]) to the activation function (g). The shortcut path (A[k]) is the input to the residual block.

A[k+2] = g( Z[k+2] + A[k] )

You can imagine 2 scenarios.

  • Scenario 1: Due to to vanishing gradients, the main path (Z[k+2]) is 0. As a result the value passed to the activation function is g( 0 + A[k] ) which is identical to the input. In this scenario, the residual block acts as the identity function returning as output whatever was passed as input.
  • Scenario 2: The main path (Z[k+2]) is nonzero. As a result the value passed to the activation function is g( Z[k+2] + A[k] ). In this scenario, both the main path and skip connection contribute to the output of the residual block.

Figure 2 of the ResNet paper describes the residual block showing both the main path F(x) and shortcut path x. Both paths meet at the sum junction before being passed to the ReLU activation function.

1 Like

@Marco_Morais Since layer k can can have multiple activations, how is it added to Z[k+2]. Are all the activations in A[k] just added up? A[k] is a vector and not a single number.

They are two vectors (or matrices in the case of multiple samples) of the same dimensions, so you can add them together “elementwise”. Resulting in yet another vector (or matrix) of the same dimensions.