I understand that the purpose of the residual block is to mitigate a vanishing gradient, with the output of the residual block:
a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] )
Firstly, in the case that the gradient goes to zero, does this not only affect dW, db? So rather than W, b → 0 would they not instead just remain constant? Surely for W, b → 0, this would have to have been learned and therefore suggest optimality?
Secondly, why do we need the skipped connection? If instead we had:
a[l+1] = g ( W[l+1] . a[l] + b[l+1] + a[l] )
then in the case that W, b → 0 we get:
a[l+1] = g ( a[l] ) = a[l]
and we only have one redundant layer rather than two. I appreciate that we could write the above as:
a[l+1] = g ( W’[l+1] . a[l] + b[l+1] )
where W’ = W + 1, and reduces to the standard forward propagation step, but explicitly adding the a[l] term dissociates it from any learned parameters so that it can’t vanish in the same way?
Hi @TheSuperLemming
Here W, b → 0 wouldn’t actually goes to zeros but it decrease dependently on lambda of regularization or on the gradient descent output to get the best g ( W[l+1] . a[l] + b[l+1] + a[l] ) which is an input for another layer …so if we suppose that W, b will be small numbers they still have an ability & effect on the output of this layer and the gradient descent algorithms is tune how much W, b will be (effect) to get best optimization target and small error without we fill into over fitting(so we can’t neglect and rely on that w , b will be zero)
please feel free to ask any questions,
Thanks,
Abdelrahman
Hi @AbdElRhaman_Fakhry
Thanks for the response, though I’m still not sure I understand.
Do you mean that adding the a[l] term inside the activation for a[l+2] prevents over-dependence on W and b?
And I’m still not sure why we can’t feed a[l] into the activation for a[l+1] instead, shouldn’t this have the same effect of allowing the network to learn an identity function but only consuming 1 layer to do it instead of 2?
Thanks,
Jason
Hi @TheSuperLemming
In a[l] term inside the activation for a[l+2] it is prevent overfitting or prevent the training error (if you have many deep NN) to increase after decreased like this photo
In the next question why we can’t feed a[l] into the activation for a[l+1] instead, as it is will be normal deep NN like this photo
and you will not benefit from the advantages of residual block which(prevent overfitting or prevent the training error (if you have many deep NN) to increase after decreased)
Thanks,
Abdelrahman
Hi @AbdElRhaman_Fakhry, thanks again for the response.
Perhaps I should clarify what I mean about the skipped as I must still be missing something.
In a normal NN we would compute:
a[l+1] = g ( W[l+1] . a[l] + b[l+1]) = g ( Z[l+1] )
I’m wondering why we can’t define the residual block as:
a[l+1] = g ( W[l+1] . a[l] + b[l+1] + a[l] ) = g ( Z[l+1] + a[l] )
which in the case that Z[l+1] → 0, we are left with:
a[l+1] = g ( a[l] ) = a[l]
And so we’ve only made our [l+1] layer redundant.
If we use the skipped connection, we instead have:
a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] ) = g ( Z[l+2] + a[l] )
which again when Z[l+2] → 0 leaves us with a[l], only now we’ve taken two layers to get there instead of just the one, so why bother with the extra layer?
Hi @TheSuperLemming
In Normal residual blocks we made it
it is mean that we benefit from one layer(a[l+1] = g ( W[l+1] . a[l] + b[l+1] ) ) and skip another (a[l+2] = g ( W[l+2] . a[l+1] + b[l+2] + a[l] ) = g ( Z[l+2] + a[l] )) if(Z[l+2] → 0) …so we at least we benefit from one layer every 2 layers and if if(Z[l+2] != 0 is large) we add term not doing called a[l] we hope it help in accuracy ( in the assignment we made a residual NN with every 3 layers so we surely benefit from 2 layers and skip or benefit less in the layer 3 and we use 3 layers to just avoid fall in overfit or cost increase after decreased)
BUT
if we do what you said we will not benefit from any layers in this blocks so it isn’t useful it take much time and cost without benefits or very small benefit if Z[l+1] != 0 …and if Z[l+1] is large it mean the we didn’t do any difference {we just add new term in g ( W[l+1] . a[l] + b[l+1] + a[l] ) = g ( Z[l+1] + a[l] ) term → a[l] } in normal deep NN and we may fall in overfitting or after the cost was decreased it may increase like photo 3
photo 3
please feel free to ask any questions,
Thanks,
Abdelrahman
I think I see now, thankyou for explaining!