In the video lesson entitled ResNets, Andrew explains why ResNets and skip connections do not have a harmful performance on neural networks with ReLU output activation functions.
However, he explains that when the weights and parameter b of the skipped layers are zero then the identity ReLU activation function means the output of the activation function of layer L is the same as layer L+2 and so does not harm the performance of the network. But if the weights and parameter b are non-zero then the output of the ReLu acitvation function at layer L+2 will still be greater then 0 but could still be a very different value from that of layer L and so could harm the performance of the network - correct?
hi @jjbarnes
you are right that the values will change when weights are non zero but that is actually the goal
the resnet learns the change F(x) on top of the input x
H(x) = F(x) + x
even if F(x) is not zero it does not harm the network because the skip connection provides a direct path for the gradient to flow during training
this makes it much easier to optimize deeper layers compared to a plain network where the gradient might disappear
hope this simplifies it for you
Hello @omarWael
Also Andrew doesn’t mention the functions F(x) and H(x) so what are these functions and what is x?
But the gradient might flow in the wrong direction.
What gradient are you referring to?
What do you mean by “…flow…”?
hi @jjbarnes
think of a residual block as a small sub-network inside the big model x is simply the input to this block which is the activation from layer L or A^{[L]} the function H(x) is the final result we want the entire block to produce which is A^{[L+2]} the function F(x) is what we call the residual mapping and it represents the layers in the main path between the shortcut
andrew explains that instead of forcing the network to learn the full mapping H(x) from scratch we simplify its job we let it learn only the difference between the input and the output so F(x) = H(x) - x this leads us to the famous equation H(x) = F(x) + x even if the weights are non-zero the network is just learning small edits or refinements to the identity x rather than trying to build the whole representation from nothing
to understand flow you have to imagine the training process when we calculate the error at the end of the network we send a signal back to the first layers to tell them how to change their weights this signal is the gradient in a very deep plain network the gradient must pass through many layers of multiplications if the weights are small the signal shrinks until it reaches zero and the early layers stop learning which is the vanishing gradient problem
now look at the math of a resnet during backpropagation when we take the derivative of the output H(x) with respect to the input x the addition rule in calculus gives us
\frac{\partial H}{\partial x} = \frac{\partial F}{\partial x} + 1
that +1 is the key it acts like a high-speed lane or a highway that allows the gradient signal to flow directly to earlier layers even if the main path \frac{\partial F}{\partial x} is struggling or has very small weights
it is impossible for the gradient to flow in the wrong direction because the gradient is mathematically defined by the direction that reduces the total loss when weights are non-zero the network is not harming the performance it is actually exploring a more complex function F(x) while having the safety net of the identity x always available if the non-zero weights were making things worse the optimizer would simply push them back toward zero during training because the easiest way to get good performance is already provided by the skip connection
What do you mean by a “…signal…”? Andrew doesn’t mention nothing about a signal.
Please adhere to the terminology used in the course to avoid ambiguity and confusion.
i sincerely apologize for any confusion caused by my choice of words and i really appreciate your commitment to the course terminology as you are absolutely right to point that out
my goal was just to share what i have learned regarding the engineering perspective behind these layers but to stay strictly within the course framework please consider
signal = gradients + backpropagation
as andrew explains this process is what allows the network to update its weights and the skip connection ensures this happens effectively even in very deep models
i will also reach out to other mentors to join our discussion and provide further insights to ensure everything is perfectly clear according to the specialization syllabus
thank you for your patience and for keeping the discussion focused on the lesson materials
Thank you for your support in helping to make this clearer for me.
Please try to explain more simply and intuitively why skip connections in reselnets don’t harm training of the model.
Hello, @jjbarnes,
I think we can use this slide from Andrew’s lecture titled “Why Resnets Work?”. In particular, around 4:13, he said " … it doesn’t really hurt your neural network … " which makes it relevant to this discussion.
We say the two layers (between a^{[l]} and a^{[l+2]}) do not hurt because skip-connection makes it easy for the network to learn that a^{[l+2]} = a^{[l]}. With the skip-connection, this only requires all the weights and biases of the additional layers be zero (as denoted by Andrew in the bottom green line). Without the skip-connection, however, this could be very difficult to learn.
In other words, without the skip-connection, a^{[l+2]} may be better or worse than a^{[l]} in terms of being a good representation of the data. However, with the skip-connection, a^{[l+2]} has the option of at least being as good as a^{[l]}, so the two additional layers won’t hurt.
Cheers,
Raymond
PS1: I tried to relate this slide back to @omarWael 's equation F(x) = H(x) + x by writing them down with a bold red pen.
PS2: I encourage you to review the lecture again to make sure you understand everything written by Andrew on the slide.
I still don’t understand. Try explaining it mathematically using a very simple CNN and just one or two input training examples.
Thank you very much @rmwkwok This is a very interesting explanation.
And it is much simpler than my explanation, the equation that I wanted to explain was very smooth by your simplify.
I still don’t understand.
Can you please re-write this explanation using the terminology and nomenclature used in the DLS course 4?
My two cents.
As an exercise for myself, I have thoroughly read @omarWael’s great explanation again and tried to draw it (which happened to be fun because it looks like a pokemon ball)
The two circles are neural network nodes. The equation is meant to capture Omar’s idea that the 1 rescues \frac{\partial{L}}{\partial{w}} in the case of small \frac{\partial{F}}{\partial{a^{[l]}}}, with a slight change of using a^{[l]} instead of the x used by Omar in his equation.
For me to think of this skip-connection idea in another way, based on what I have learned about “overfitting” and “vanishing gradient” in previous lessons, I know that indefinitely appending new layers is not going to work, but then how can I balance them - appending layers without fearing of such performance drop? I make it possible for the added layers to be shut down - those layers’ weights can be zero but thanks to the skip-connection, subsequent layers (including the output layer) won’t just get zeros. In this sense, I choose to append layers, but the training process can shut it down.
Cheers,
Raymond
PS: for those who don’t know what pokemon ball is:
Source: Pokémon Diamond and Pearl Ash Ketchum Pikachu May Dawn, pikachu, fictional Character, pokemon, may png | PNGWing
Please can you re-write using LaTex for Math instead of the handwritten math?
i really love the pokemon ball idea
it is a great way to show how resnets work and i am sure students will remember this visual for a long time who knew learning deep learning could be this fun 
i also agree with your technical points the way you explained how the skip-connection saves the signal and prevents the gradient from vanishing is exactly right
it is a pleasure to see these hard topics easier for the community thank you for taking the time to improve the explanation , @rmwkwok
best regards
omar
Pokémon is not helping me understand skip connections in resnets and isn’t included in the course.
What does “…residual mapping…” mean? Andrew doesn’t use this terminology in his course.
Please re-write this post using only the terminology and nomenclature used in the course to avoid ambiguity and confusion.
@jjbarnes
your understanding is correct that a non-zero weight and parameters does give a non-zero activation layer output in the subsequent layer but if you know ResNet network uses SGD optimiser which pushes the weight of non-zero layer towards zero. This makes sure the output of each residual block is output plus the original or previous value to learn non-linear patterns of a network yet going back to its starting point.
i.e. basically resnet network maintains the model performance by adding original input to output of the each layer passing through a deeper neural network.
Andrew didnt explain about the non-zero values in the video,.so I cannot provide you reference from the video you are referring to, but if you refer the original resent model architecture, you will understand how identify block (output+input) maintains the model performance in a resnet network.