It appears that the depth-wise separable convolution saves some computational cost by avoiding doing redundant/repeated dot product without any trade-off in performance. My question then is, why don’t use depth-wise separable convolution in general for any architecture whenever it employs CNN? Isn’t that most cost efficient? Also why the specific MobileNet v1 architecture since depth-wise separable convolution is seemingly not limited and affected by the choice of architecture?
I may be wrong, but here is what I think. Let say we convolve an img 10x10x3 with a filter 3x3x3 to get an output 8x8x256.
In normal conv, we have 256 different filters 3x3x3. In the depth-wise separable conv, we have only one filter 3x3x3 and 256 filters 1x1x3.
It means in the normal conv, we have many more parameters to train than the other. In many image processing applications, we do need more parameters (weights) to hold different features, although it takes more time to train.
Yes even i think the same about it. As the number of parameters will be less in case of MobileNet and that too using a 1x1xnc filter, I feel It will put a performance barrier for the Neural Network.
But aren’t those parameters the same parameters being repeatedly used?