Convolution Confusion (ResNets) C4W2

Nevermnd · April 15, 2024, 9:10pm

So consider I am mid-lecture (perhaps this becomes clearer in the assignment)–

But it is simply not clear what is going on here (my questions/notes are in red):

So,

Are we still using the activation in the prior layer a^l and then carrying the standard activation and weight operations on all the layers ‘in between’, and then reapplying that activation upon a later layer via an add ?

1a) Ok, assume we are using the activation on a later layer-- Does that mean we then do not apply it to the next one in line ? Or still do so ?

If this is an ‘add’ is it just a straight matrix addition, or something else ?
Given that Prof. Ng displays it as a ‘big NN’ before this point, are we still doing this on the Conv layers, or only the fully connected ones ? I mean in his diagram, drawing that kind of looks like a fully connected layer.

A few questions here but I found this topic confusing…

hackyon · April 15, 2024, 9:58pm

Yes, in this example, we are still using the activation function for the layers “in-between”. We still have activations for both the layers shown in the diagram.

The only difference is that, for the second layer, we add the residual to z[l+2] before applying the activation.

There are 2 layers in the diagram, and both layers have activations.

Yes, it is a straight matrix element-wise add.

This diagram is just a simple example explaining the concept of residuals. Yes, the 2 layers in the diagram look like simple fully connected layers.

However, the same concept of adding back the residuals can be applied to conv nets as well. You will need to make sure the matrix dimensions match for the addition. I believe this will be covered in future lectures on image segmentation using conv nets.

As an aside, this concept of adding back the residuals is found in a variety of different models, not just conv or vision/image models. For example, it’s also used in the popular transformers model (the architecture for ChatGPT).

rmwkwok · April 15, 2024, 10:26pm

In addition to @hackyon’s great answers, for your Q3, we can refer to this chart later the video.

In a real ResNet, not just any residual network, we are dealing with Conv layers.

Cheers,
Raymond

Nevermnd · April 15, 2024, 10:31pm

Thank you for adding this additional image Raymond, as I wondered about this too, but forgot.

Thus this might be my Q4.

I can get, then, what is happening with most of this-- But what does the dotted lines in the jump mean ?

I’m not certain it was explained…

rmwkwok · April 15, 2024, 10:45pm

Oh, that occurs whenever there is a dimension increase. For example, the following one happens from 64D to 128D. Check the full chart in the video for other dimension changes.

The dash line indicates that the skip connection there is implemented differently to take care of the dimension difference.

Note that we need two matrices to share the same dimension for them to add up.

Cheers,
Raymond

Nevermnd · April 16, 2024, 8:35am

Oh okay @rmwkwok, thanks – He does discuss this procedure in the lecture, yet unless you super zoom in on the video (as you have done) it is otherwise a little hard to see/understand what was going on here.

rmwkwok · April 16, 2024, 10:49am

Great, Anthony @Nevermnd. I didn’t check the video. I suppose he also explained how to make the dimensions the same?

Nevermnd · April 25, 2024, 11:55am

@rmwkwok Yes he did. And thanks also for the answers @hackyon.

– However, I am still a little confused here.

So I was able to complete the ResNet Programming Assignment just fine, but though we’d talking always about it thus far-- Are these types of networks not doing back-prop (only forward-prop) ?!?

I mean for one in the written TF code I don’t obviously see where a back-prop step might be occurring.

And though we obviously wish to increase our accuracy, there is no more talk of a ‘loss function’. I mean technically accuracy in a way is it, but I don’t see where in the TF code we are specifying that as our desired metric. [I’m going to keep this question here for others, but I was just back looking more closely and see this does come up in the course provided code block]:

Though what is this Keras Tutorial Notebook it is referring to ? From week 1 ?

Finally, is there some reason we use the glorot_uniform initializer instead for the convolution block, but randomuniform for the identity block?

I feel like I’m still not getting the concept of the ‘identity’ block… Is like an identity matrix on the weights or something ?

hackyon · April 25, 2024, 12:16pm

These models are perfectly capable of back-prop, but we just may or may not implement back-prop in the assignments (may be out of scope).

The loss='categorical_crossentropy' that is specified in model.compile() is usually enough for TF to know what loss function to use. You can also use a separate loss object or function to manually compute loss and apply back-prop.

For many modern ML libraries (like TF and pytorch), the back-prop implementation is taken care of by the library. The library keeps track of all the computations you do to the input (using a computation graph), which allows them to apply back-prop for you without you having to implement it yourself.

I don’t know about the Keras Tutorial Network either, so that sentence might just be out-of-date.

I think the high-level idea is that residual connections allows a block to act like an identity function (where the output of the block are similar to/close in values to the input of the block). Without residual connections, it would be difficult for a block to model an identity function.

Nevermnd · April 25, 2024, 1:11pm

In the assignment we do make extensive use of the Conv2D Keras function. I wonder if somehow this function ‘knows’ or otherwise automatically implements back prop if chained along with other functions like BatchNormalization, Activation, etc ?

hackyon · April 25, 2024, 2:00pm

Yes, Conv2D from TF can automatically perform back prop if chained properly.

paulinpaloalto · April 25, 2024, 3:43pm

Yes, that’s one of the beauties of TF/Keras and other platforms like PyTorch: they completely free us from worrying about backprop. It is all handled by the platform. For “canned” functions like the activations that the platform provides, they provide the derivative functions. For everything else, they use a mathematical technique called “automatic differentiation” that is based on ideas similar to Gradient Checking, which we saw in DLS C2 W1. E.g. here’s a place to start in the TF documentation tree to understand more.

It’s all “magic” and transparent as long as we are careful and follow the rules. E.g. one very important point is that we have to be careful that all the functions in any “compute graph” that we create are TF functions. If you put a single numpy function, even something as simple as np.transpose or .T in the computation, then learning fails because the gradients are incomplete.

Topic		Replies	Views
Course 4 week 2, quiz: my answer is correct (I think) but was graded incorrect Convolutional Neural Networks	3	582	June 4, 2022
Course 4, week 2, assignment 2, exercise 1: my suggestion Convolutional Neural Networks	1	518	March 7, 2022
Can someone explain to me in DL week4 assignment why do we need to replicate L-1 copies of previous activations Convolutional Neural Networks	1	486	August 24, 2022
Questions on Week 3 "Neural Network Overview" Neural Networks and Deep Learning week-3	2	315	March 8, 2024
Making sure my understanding for NN is correct Neural Networks and Deep Learning	2	552	June 5, 2021

Convolution Confusion (ResNets) C4W2

Related topics