ResNets and network performance

jjbarnes · April 28, 2026, 5:54pm

In the video lesson entitled ResNets, Andrew explains why ResNets and skip connections do not have a harmful performance on neural networks with ReLU output activation functions.

However, he explains that when the weights and parameter b of the skipped layers are zero then the identity ReLU activation function means the output of the activation function of layer L is the same as layer L+2 and so does not harm the performance of the network. But if the weights and parameter b are non-zero then the output of the ReLu acitvation function at layer L+2 will still be greater then 0 but could still be a very different value from that of layer L and so could harm the performance of the network - correct?

omarWael · April 28, 2026, 9:38pm

hi @jjbarnes

you are right that the values will change when weights are non zero but that is actually the goal

the resnet learns the change F(x) on top of the input x

H(x) = F(x) + x

even if F(x) is not zero it does not harm the network because the skip connection provides a direct path for the gradient to flow during training

this makes it much easier to optimize deeper layers compared to a plain network where the gradient might disappear

hope this simplifies it for you

jjbarnes · April 29, 2026, 7:53am

Hello @omarWael

Also Andrew doesn’t mention the functions F(x) and H(x) so what are these functions and what is x?

But the gradient might flow in the wrong direction.

What gradient are you referring to?

What do you mean by “…flow…”?

omarWael · April 29, 2026, 3:49pm

hi @jjbarnes

think of a residual block as a small sub-network inside the big model x is simply the input to this block which is the activation from layer L or A^{[L]} the function H(x) is the final result we want the entire block to produce which is A^{[L+2]} the function F(x) is what we call the residual mapping and it represents the layers in the main path between the shortcut

andrew explains that instead of forcing the network to learn the full mapping H(x) from scratch we simplify its job we let it learn only the difference between the input and the output so F(x) = H(x) - x this leads us to the famous equation H(x) = F(x) + x even if the weights are non-zero the network is just learning small edits or refinements to the identity x rather than trying to build the whole representation from nothing

to understand flow you have to imagine the training process when we calculate the error at the end of the network we send a signal back to the first layers to tell them how to change their weights this signal is the gradient in a very deep plain network the gradient must pass through many layers of multiplications if the weights are small the signal shrinks until it reaches zero and the early layers stop learning which is the vanishing gradient problem

now look at the math of a resnet during backpropagation when we take the derivative of the output H(x) with respect to the input x the addition rule in calculus gives us

\frac{\partial H}{\partial x} = \frac{\partial F}{\partial x} + 1

that +1 is the key it acts like a high-speed lane or a highway that allows the gradient signal to flow directly to earlier layers even if the main path \frac{\partial F}{\partial x} is struggling or has very small weights

it is impossible for the gradient to flow in the wrong direction because the gradient is mathematically defined by the direction that reduces the total loss when weights are non-zero the network is not harming the performance it is actually exploring a more complex function F(x) while having the safety net of the identity x always available if the non-zero weights were making things worse the optimizer would simply push them back toward zero during training because the easiest way to get good performance is already provided by the skip connection

jjbarnes · April 29, 2026, 5:43pm

What do you mean by a “…signal…”? Andrew doesn’t mention nothing about a signal.

jjbarnes · April 29, 2026, 5:45pm

Please adhere to the terminology used in the course to avoid ambiguity and confusion.

omarWael · April 29, 2026, 7:28pm

i sincerely apologize for any confusion caused by my choice of words and i really appreciate your commitment to the course terminology as you are absolutely right to point that out

my goal was just to share what i have learned regarding the engineering perspective behind these layers but to stay strictly within the course framework please consider
signal = gradients + backpropagation

as andrew explains this process is what allows the network to update its weights and the skip connection ensures this happens effectively even in very deep models

i will also reach out to other mentors to join our discussion and provide further insights to ensure everything is perfectly clear according to the specialization syllabus

thank you for your patience and for keeping the discussion focused on the lesson materials

jjbarnes · April 29, 2026, 8:51pm

Thank you for your support in helping to make this clearer for me.

jjbarnes · April 30, 2026, 8:33am

Please try to explain more simply and intuitively why skip connections in reselnets don’t harm training of the model.

rmwkwok · April 30, 2026, 9:51am

Hello, @jjbarnes,

I think we can use this slide from Andrew’s lecture titled “Why Resnets Work?”. In particular, around 4:13, he said " … it doesn’t really hurt your neural network … " which makes it relevant to this discussion.

We say the two layers (between a^{[l]} and a^{[l+2]}) do not hurt because skip-connection makes it easy for the network to learn that a^{[l+2]} = a^{[l]}. With the skip-connection, this only requires all the weights and biases of the additional layers be zero (as denoted by Andrew in the bottom green line). Without the skip-connection, however, this could be very difficult to learn.

In other words, without the skip-connection, a^{[l+2]} may be better or worse than a^{[l]} in terms of being a good representation of the data. However, with the skip-connection, a^{[l+2]} has the option of at least being as good as a^{[l]}, so the two additional layers won’t hurt.

Cheers,
Raymond

PS1: I tried to relate this slide back to @omarWael 's equation F(x) = H(x) + x by writing them down with a bold red pen.

PS2: I encourage you to review the lecture again to make sure you understand everything written by Andrew on the slide.

jjbarnes · April 30, 2026, 11:35am

I still don’t understand. Try explaining it mathematically using a very simple CNN and just one or two input training examples.

omarWael · April 30, 2026, 12:58pm

Thank you very much @rmwkwok This is a very interesting explanation.

And it is much simpler than my explanation, the equation that I wanted to explain was very smooth by your simplify.

jjbarnes · April 30, 2026, 2:58pm

I still don’t understand.

jjbarnes · May 1, 2026, 4:09pm

Can you please re-write this explanation using the terminology and nomenclature used in the DLS course 4?

rmwkwok · May 2, 2026, 2:27am

My two cents.

As an exercise for myself, I have thoroughly read @omarWael’s great explanation again and tried to draw it (which happened to be fun because it looks like a pokemon ball)

The two circles are neural network nodes. The equation is meant to capture Omar’s idea that the 1 rescues \frac{\partial{L}}{\partial{w}} in the case of small \frac{\partial{F}}{\partial{a^{[l]}}}, with a slight change of using a^{[l]} instead of the x used by Omar in his equation.

For me to think of this skip-connection idea in another way, based on what I have learned about “overfitting” and “vanishing gradient” in previous lessons, I know that indefinitely appending new layers is not going to work, but then how can I balance them - appending layers without fearing of such performance drop? I make it possible for the added layers to be shut down - those layers’ weights can be zero but thanks to the skip-connection, subsequent layers (including the output layer) won’t just get zeros. In this sense, I choose to append layers, but the training process can shut it down.

Cheers,
Raymond

PS: for those who don’t know what pokemon ball is:

Source: Pokémon Diamond and Pearl Ash Ketchum Pikachu May Dawn, pikachu, fictional Character, pokemon, may png | PNGWing

jjbarnes · May 2, 2026, 10:34am

Please can you re-write using LaTex for Math instead of the handwritten math?

omarWael · May 2, 2026, 11:40am

i really love the pokemon ball idea it is a great way to show how resnets work and i am sure students will remember this visual for a long time who knew learning deep learning could be this fun

i also agree with your technical points the way you explained how the skip-connection saves the signal and prevents the gradient from vanishing is exactly right

it is a pleasure to see these hard topics easier for the community thank you for taking the time to improve the explanation , @rmwkwok

best regards
omar

jjbarnes · May 2, 2026, 2:48pm

Pokémon is not helping me understand skip connections in resnets and isn’t included in the course.

jjbarnes · May 2, 2026, 4:07pm

What does “…residual mapping…” mean? Andrew doesn’t use this terminology in his course.

Please re-write this post using only the terminology and nomenclature used in the course to avoid ambiguity and confusion.

Deepti_Prasad · May 2, 2026, 6:38pm

@jjbarnes

your understanding is correct that a non-zero weight and parameters does give a non-zero activation layer output in the subsequent layer but if you know ResNet network uses SGD optimiser which pushes the weight of non-zero layer towards zero. This makes sure the output of each residual block is output plus the original or previous value to learn non-linear patterns of a network yet going back to its starting point.

i.e. basically resnet network maintains the model performance by adding original input to output of the each layer passing through a deeper neural network.

Andrew didnt explain about the non-zero values in the video,.so I cannot provide you reference from the video you are referring to, but if you refer the original resent model architecture, you will understand how identify block (output+input) maintains the model performance in a resnet network.

Topic		Replies	Views
Pls elobarate on how skip connection helps gradients backpropogate Convolutional Neural Networks coursera-platform	1	557	July 30, 2022
ResNets Question Convolutional Neural Networks coursera-platform	5	615	June 20, 2024
Sense of ResNet Convolutional Neural Networks coursera-platform	1	502	May 16, 2023
Why ResNets work? weight decay causes activations to be same Convolutional Neural Networks coursera-platform	2	476	July 10, 2023
Skip connections in ResNets Convolutional Neural Networks coursera-platform	2	607	October 3, 2021

ResNets and network performance

Related topics