Question about residual blocks and skipped connections

TheSuperLemming · November 22, 2022, 10:55pm

I understand that the purpose of the residual block is to mitigate a vanishing gradient, with the output of the residual block:

a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] )

Firstly, in the case that the gradient goes to zero, does this not only affect dW, db? So rather than W, b → 0 would they not instead just remain constant? Surely for W, b → 0, this would have to have been learned and therefore suggest optimality?

Secondly, why do we need the skipped connection? If instead we had:

a[l+1] = g ( W[l+1] . a[l] + b[l+1] + a[l] )

then in the case that W, b → 0 we get:

a[l+1] = g ( a[l] ) = a[l]

and we only have one redundant layer rather than two. I appreciate that we could write the above as:

a[l+1] = g ( W’[l+1] . a[l] + b[l+1] )

where W’ = W + 1, and reduces to the standard forward propagation step, but explicitly adding the a[l] term dissociates it from any learned parameters so that it can’t vanish in the same way?

AbdElRhaman_Fakhry · November 23, 2022, 10:01pm

Hi @TheSuperLemming

Here W, b → 0 wouldn’t actually goes to zeros but it decrease dependently on lambda of regularization or on the gradient descent output to get the best g ( W[l+1] . a[l] + b[l+1] + a[l] ) which is an input for another layer …so if we suppose that W, b will be small numbers they still have an ability & effect on the output of this layer and the gradient descent algorithms is tune how much W, b will be (effect) to get best optimization target and small error without we fill into over fitting(so we can’t neglect and rely on that w , b will be zero)

please feel free to ask any questions,
Thanks,
Abdelrahman

TheSuperLemming · November 30, 2022, 11:07pm

Hi @AbdElRhaman_Fakhry

Thanks for the response, though I’m still not sure I understand.

Do you mean that adding the a[l] term inside the activation for a[l+2] prevents over-dependence on W and b?

And I’m still not sure why we can’t feed a[l] into the activation for a[l+1] instead, shouldn’t this have the same effect of allowing the network to learn an identity function but only consuming 1 layer to do it instead of 2?

Thanks,

Jason

AbdElRhaman_Fakhry · December 1, 2022, 2:21pm

Hi @TheSuperLemming

In a[l] term inside the activation for a[l+2] it is prevent overfitting or prevent the training error (if you have many deep NN) to increase after decreased like this photo

In the next question why we can’t feed a[l] into the activation for a[l+1] instead, as it is will be normal deep NN like this photo

and you will not benefit from the advantages of residual block which(prevent overfitting or prevent the training error (if you have many deep NN) to increase after decreased)

Thanks,
Abdelrahman

TheSuperLemming · December 2, 2022, 7:57pm

Hi @AbdElRhaman_Fakhry, thanks again for the response.

Perhaps I should clarify what I mean about the skipped as I must still be missing something.

In a normal NN we would compute:

a[l+1] = g ( W[l+1] . a[l] + b[l+1]) = g ( Z[l+1] )

I’m wondering why we can’t define the residual block as:

a[l+1] = g ( W[l+1] . a[l] + b[l+1] + a[l] ) = g ( Z[l+1] + a[l] )

which in the case that Z[l+1] → 0, we are left with:

a[l+1] = g ( a[l] ) = a[l]

And so we’ve only made our [l+1] layer redundant.

If we use the skipped connection, we instead have:

a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] ) = g ( Z[l+2] + a[l] )

which again when Z[l+2] → 0 leaves us with a[l], only now we’ve taken two layers to get there instead of just the one, so why bother with the extra layer?

AbdElRhaman_Fakhry · December 3, 2022, 1:14am

Hi @TheSuperLemming

In Normal residual blocks we made it

it is mean that we benefit from one layer(a[l+1] = g ( W[l+1] . a[l] + b[l+1] ) ) and skip another (a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] ) = g ( Z[l+2] + a[l] )) if(Z[l+2] → 0) …so we at least we benefit from one layer every 2 layers and if if(Z[l+2] != 0 is large) we add term not doing called a[l] we hope it help in accuracy ( in the assignment we made a residual NN with every 3 layers so we surely benefit from 2 layers and skip or benefit less in the layer 3 and we use 3 layers to just avoid fall in overfit or cost increase after decreased)

BUT

if we do what you said we will not benefit from any layers in this blocks so it isn’t useful it take much time and cost without benefits or very small benefit if Z[l+1] != 0 …and if Z[l+1] is large it mean the we didn’t do any difference {we just add new term in g ( W[l+1] . a[l] + b[l+1] + a[l] ) = g ( Z[l+1] + a[l] ) term → a[l] } in normal deep NN and we may fall in overfitting or after the cost was decreased it may increase like photo 3

photo 3

please feel free to ask any questions,
Thanks,
Abdelrahman

TheSuperLemming · December 5, 2022, 6:05pm

I think I see now, thankyou for explaining!

Topic		Replies	Views
Sense of ResNet Convolutional Neural Networks coursera-platform	1	492	May 16, 2023
Why do ResNets work? Convolutional Neural Networks coursera-platform	3	513	February 21, 2023
Skipped connection in ResNet Convolutional Neural Networks coursera-platform	4	528	March 28, 2024
Quiz week 2, question Q5 Convolutional Neural Networks coursera-platform	2	528	November 7, 2021
RESNET Explanation Convolutional Neural Networks coursera-platform	1	481	August 26, 2022

Question about residual blocks and skipped connections

BUT

Related topics