1*1. Convolution doubt

Professor Ng mentioned how we can use 11 convolution filters to reduce the channel size and keep the width and height preserved. But my point is, We can use same padding and use even 33 (any dimensions) filters to achieve this. I just need to reduce the number of filters irrespective whether its 11 or 33. I guess this is not unique to 1*1 filters right?

Yes, there is more than one way to implement a layer that preserves the h and w dimensions and reduces the number of output channels. But the point is that 1 x 1 convolutions are very specific for that purpose: there is no “bleeding together” of the pixels as there are in “same” convolutions with a filter size > 1. It turns out that both styles are useful in the right situation, as we’ll see in the next few lectures.

Thats Paul, I have another doubt. In Mobinet, We saw depth wise separable convolution. Instead of going through depth first and then point wise convolution, We could have used Point wise convolution alone to achieve the results . Why did we have to do the depth first approach? Is it because point wise cannot capture the edges effectively ?

Here again, the point is that different styles of convolutions have different effects. If all you’re talking about is the dimension of the output, then sure there are lots of ways to get that. But the point is not all of them have the same effect in terms of the values of the various output elements. That matters at least as much as the shape, right? I disagree that you could achieve the same effect doing only a pointwise convolution as doing a depthwise convolution followed by a pointwise convolution. Of course it depends on the parameters of your depthwise convolution, but what you said is clearly not a true statement in the general case. The case you mention of edge detection is an example of something that a depthwise convolution could do, but a pointwise convolution can’t.

Exactly, I got that point. To make the dimensions, We. can have many filter sizes to achieve it. But the thing is the values of a f=3 shape convolution filter will be more detailed than a point wise convolution right? Because point wise is always cheaper in terms of computation but the output values isn’t interesting enough to capture features and that’s why we use depth wise or a f=3 filters which can capture more features. Is that right Sir?

Yes, a convolution with filter size > 1 can extract different information from the inputs than a pointwise convolution. Maybe the way to think of pointwise is that it is a relatively efficient way to compress the information or “downsample” it, although it’s also transforming the information in addition to simply reducing the size of it. And the coefficients are learned so it’s different than a pooling layer. Because of the fact that these layers are also trained, the theory must be that they are learning to extract the important information from the larger number of inputs.

There are a number of interesting things to understand in the MobilNet architecture. The bottleneck structure expands and then contracts in each bottleneck section. Here’s another thread with some discussion about that.