Confusion about MobileNet

So in week 2 MobileNet video, we have its architecture look like this :

i dont understand why depth-wise layer and 1x1 layer will hurt on performance because i think normal 3x3 convolution and depth-wise then 1x1 work the same as each other . I mean 3x3 convolution differ from depth-wise just in the part that they sum up together right ? And if our 1x1 convolution is contain number 1 only and we just add up everything together here isn’t it ?

Hello @cpp219,
Thanks for posting and i will try to exaplain more for you.

The depth-wise convolutional layer operates differently from normal 3x3 Convolution. Instead of using a single 3x3 filter for all channels, it applies individual 3x3 filters to each channel separately. This reduces the computational complexity significantly, making it more suitable for mobile or lightweight models.

On the other hand when talking about 1x1 Convolution it helps in channel-wise transformation and feature mixing without affecting the spatial dimensions of the data.

So now let’s combine the layers together to see its benefits
1- Rich Spatial Information: The 3x3 convolution captures spatial patterns well and allows the model to learn complex features from the input image which i guess you already know that
2- Reduced Computational Cost: By applying depth-wise convolution after 3x3, we significantly reduce the computational burden since it operates on individual channels, making it computationally efficient.

3- Channel-wise Transformation: The subsequent 1x1 convolution allows for channel-wise transformation and feature mixing, which can be beneficial for learning more abstract representations.

I hope it makes sense now and feel free to ask for more clarification
Best Regards,

1 Like