When explaining the convolution from the input (or previous output) with a size of 4 x 4 x 3 and then using convolution with a 1 x 1 x 3 is completely understandable how he gets one matrix of 4 x 4 x 1, however when he talks about the nc’ = 5 and then getting 4 x 4 x 5.

How?

Are the other four matrices behind the first one same as the first one? How do nc’ filters are used in the computation? @paulinpaloalto@andrewng

Prof Ng explains that in the lecture. Listen again at 9:35. Unlike the “depthwise” step, the pointwise convolutions work the same as “normal” convolutions: the number of filters determines the number of output channels. Each filter must match the number of input channels. Each filter is learned differently (has a different purpose) and the total number of them you define (choose) determines the number of output channels. The choice of the number of filters is what Prof Ng calls a “hyperparameter”, meaning simply a choice you need to make. In this case he has choosen to have 5 filters, so the total dimensions of W for the second “pointwise” step are 1 x 1 x 3 x 5, right?

I understand that part of choosing that hyperparameter of 5 filters; the question is more how are the other four are being computed. I understand the first “layer”, but how are the other computed?

I don’t understand your point. This is exactly like a “normal” convolution: it just happens that the filter size is f = 1, right? So you have five different filters each shaped 1 x 1 x 3 and you apply each one individually just as you normally apply a “conv” filter. Each one gives you a 4 x 4 x 1 output and there are 5 of them, so you end up with an output that is 4 x 4 x 5. What is mysterious or confusing about that?

And note that we don’t “choose” the filter values, just the number of filters. The filter values are “parameters”, meaning that they are learned through back propagation, just like normal. Absolutely flavor vanilla compared to everything else we’ve learned up to this point, right?

The depthwise filters are completely separate: that is the previous step. It is completely independent. Those parameters are also learned through back propagation. But of course back propagation propagates through all the layers using the Chain Rule just like normal, so what happens in the later layers affects the earlier layers.

Yes, I guess the picture could have been more complete. But he literally said all the necessary words in the lecture to explain what he means here. It’s the area around 9:35 into that lecture.

Hi Paulinpaloalto, I had similar question as DrRobot had. It got clarified a bit from your answer but I want to confirm 1 more. In 1x1x3 matrix in point wise matrix above where value of the filter is (2,2,2). So these values should be same in all 3 position since it is actually just 1 filter. isn’t it?

Yes, that is just an example and maybe not a very good one. Typically you would not expect the filter values to all be the same in all positions or even be integers for that matter. There is no reason why that would actually happen in “real life” as all those values are learned through back propagation and start from random initializations for symmetry breaking.

You can see that the values produced match, though:

I just realized you are saying these convolutional filters(not Maxpooling) are also trainable variables via back-prop. But edge detectors are not trainable filters? Then weights to filters and filter itself are both getting trained throughout back-prop?

The “edge filters” that Prof Ng shows in Week 1 of Convnets are just a demonstration of how conv layers can detect things. That is “old school” and nobody uses hand-coded filters like that these days: you just randomly initialize the filters and let back prop learn what it needs to in order to solve the problem at hand.

What is the difference between Pointwise Convolution and 1x1 Convolutions that we learnt earlier? Seems like they are no different. Is there a reason they are named differently? If so what is the difference?

Yes, I believe those are just two names for the same operation. It’s not unusual for people to have multiple ways to name or describe the same operation or phenomenon.