Convolution Confusion (ResNets) C4W2

So consider I am mid-lecture (perhaps this becomes clearer in the assignment)–

But it is simply not clear what is going on here (my questions/notes are in red):

So,

  1. Are we still using the activation in the prior layer a^l and then carrying the standard activation and weight operations on all the layers ‘in between’, and then reapplying that activation upon a later layer via an add ?

1a) Ok, assume we are using the activation on a later layer-- Does that mean we then do not apply it to the next one in line ? Or still do so ?

  1. If this is an ‘add’ is it just a straight matrix addition, or something else ?

  2. Given that Prof. Ng displays it as a ‘big NN’ before this point, are we still doing this on the Conv layers, or only the fully connected ones ? I mean in his diagram, drawing that kind of looks like a fully connected layer.

A few questions here but I found this topic confusing…

1 Like

Yes, in this example, we are still using the activation function for the layers “in-between”. We still have activations for both the layers shown in the diagram.

The only difference is that, for the second layer, we add the residual to z[l+2] before applying the activation.

There are 2 layers in the diagram, and both layers have activations.

Yes, it is a straight matrix element-wise add.

This diagram is just a simple example explaining the concept of residuals. Yes, the 2 layers in the diagram look like simple fully connected layers.

However, the same concept of adding back the residuals can be applied to conv nets as well. You will need to make sure the matrix dimensions match for the addition. I believe this will be covered in future lectures on image segmentation using conv nets.

As an aside, this concept of adding back the residuals is found in a variety of different models, not just conv or vision/image models. For example, it’s also used in the popular transformers model (the architecture for ChatGPT).

3 Likes

In addition to @hackyon’s great answers, for your Q3, we can refer to this chart later the video.

In a real ResNet, not just any residual network, we are dealing with Conv layers.

Cheers,
Raymond

1 Like

Thank you for adding this additional image Raymond, as I wondered about this too, but forgot.

Thus this might be my Q4.

I can get, then, what is happening with most of this-- But what does the dotted lines in the jump mean ?

I’m not certain it was explained…

1 Like

Oh, that occurs whenever there is a dimension increase. For example, the following one happens from 64D to 128D. Check the full chart in the video for other dimension changes.

image

The dash line indicates that the skip connection there is implemented differently to take care of the dimension difference.

Note that we need two matrices to share the same dimension for them to add up.

Cheers,
Raymond

1 Like

Oh okay @rmwkwok, thanks – He does discuss this procedure in the lecture, yet unless you super zoom in on the video (as you have done) it is otherwise a little hard to see/understand what was going on here.

1 Like

Great, Anthony @Nevermnd. I didn’t check the video. I suppose he also explained how to make the dimensions the same?

@rmwkwok Yes he did. And thanks also for the answers @hackyon.

– However, I am still a little confused here.

So I was able to complete the ResNet Programming Assignment just fine, but though we’d talking always about it thus far-- Are these types of networks not doing back-prop (only forward-prop) ?!?

I mean for one in the written TF code I don’t obviously see where a back-prop step might be occurring.

And though we obviously wish to increase our accuracy, there is no more talk of a ‘loss function’. I mean technically accuracy in a way is it, but I don’t see where in the TF code we are specifying that as our desired metric. [I’m going to keep this question here for others, but I was just back looking more closely and see this does come up in the course provided code block]:

Though what is this Keras Tutorial Notebook it is referring to ? From week 1 ?

Finally, is there some reason we use the glorot_uniform initializer instead for the convolution block, but randomuniform for the identity block?

I feel like I’m still not getting the concept of the ‘identity’ block… Is like an identity matrix on the weights or something ?

These models are perfectly capable of back-prop, but we just may or may not implement back-prop in the assignments (may be out of scope).

The loss='categorical_crossentropy' that is specified in model.compile() is usually enough for TF to know what loss function to use. You can also use a separate loss object or function to manually compute loss and apply back-prop.

For many modern ML libraries (like TF and pytorch), the back-prop implementation is taken care of by the library. The library keeps track of all the computations you do to the input (using a computation graph), which allows them to apply back-prop for you without you having to implement it yourself.

I don’t know about the Keras Tutorial Network either, so that sentence might just be out-of-date.

I think the high-level idea is that residual connections allows a block to act like an identity function (where the output of the block are similar to/close in values to the input of the block). Without residual connections, it would be difficult for a block to model an identity function.

1 Like

In the assignment we do make extensive use of the Conv2D Keras function. I wonder if somehow this function ‘knows’ or otherwise automatically implements back prop if chained along with other functions like BatchNormalization, Activation, etc ?

Yes, Conv2D from TF can automatically perform back prop if chained properly.

Yes, that’s one of the beauties of TF/Keras and other platforms like PyTorch: they completely free us from worrying about backprop. It is all handled by the platform. For “canned” functions like the activations that the platform provides, they provide the derivative functions. For everything else, they use a mathematical technique called “automatic differentiation” that is based on ideas similar to Gradient Checking, which we saw in DLS C2 W1. E.g. here’s a place to start in the TF documentation tree to understand more.

It’s all “magic” and transparent as long as we are careful and follow the rules. E.g. one very important point is that we have to be careful that all the functions in any “compute graph” that we create are TF functions. If you put a single numpy function, even something as simple as np.transpose or .T in the computation, then learning fails because the gradients are incomplete.

1 Like