Hi, Andrew explained MobileNet saved computational effort in his lectures which makes perfect sense. However, he did not have time to explain why MobileNet should have similar performance (accuracy) to conventional ConvNet. What is the intuition here? It is not too difficult to develop something computational less expensive but to show it does not reduce performance is a much more difficult matter. Any comments? Thank you!
I felt the same way. I don’t quite get whether there is a disadvantage to splitting the convolution into two steps. Because if there wasn’t, why even use anything else? since it’ll just save in resources. The fact that the splitting is not featured in other networks makes me think that there is a disadvantage it and Andrew didn’t mention it.
Are mentors monitoring this forum? Or is this a place for students to exchange ideas among themselves?
Great question! Sorry for the late reply. Sometimes, it is worthwhile bumping a thread if it becomes stale because it might happen to fly under the radar of mentors
The depthwise separable convolutions reduce the number of parameters in the convolution. As such, for a small model, the model capacity may be decreased significantly if the 2D convolutions are replaced by depthwise separable convolutions. As a result, the model may become sub-optimal. However, if properly used, depthwise separable convolutions can give you efficiency without dramatically damaging your model performance.
The key take away, again, is:
The depthwise separable convolutions reduce the number of parameters in the convolution.
As an exercise, make up a toy example and calculate the number of params for each method. Prof Andrew Ng demonstrated the number of multiplications, but not the number of params. It is a good exercise and might help you understand even better.
Thanks for the reply.
Yeah, true. I have a better understanding now. It makes me think that while there is a disadvantage to sub-optimizing it, there is an advantage because you can use different networks that researchers have developed and you can sub-optimize them in a way to not mess with their over all architecture but save on resources. You’re only tweaking the way it computes convolutions.
(I am hypothesizing)
I agree, given a fixed computation budget, you might not be able to have a model with enough depth as you would if you used depthwise separable convolutions. Albeit, you will have fewer parameters, but at least parameters that can learn both low-level features and high-level features in deeper layers. Without ordinary convolutions, you might not have the computational budget to achieve the architecture you hypothesize performs the best for your task on the device you want. MobileNet then comes to the rescue. For state-of-the-art model performance, maybe another model, ordinary convolutions, unlimited computational budget, etc., will do better.
I wanted to give the pen and paper exercise a go
has (3 \times 3 \times 3 + 1) \times 128 = 28 \times 128 = 3584 parameters.
has (3 \times 3 \times 1 + 1) \times 3 = 10 \times 3 = 30 parameters.
has (1 \times 1 \times 3 + 1) \times 128 = 4 \times 128 = 512 parameters.
thus has 30 + 512 = 542 parameters, compared to 3584 for the standard convolution.
Moreover, 542 / 3584 = 0.15.
Hence, the depthwise separable convolution only has 15\% of the number of parameters of the standard convolution (for this example)!
Thanks to Kunlun Bai for the amazing graphics: