When we compare the output from a normal conv and from a depth wise separable conv, the output have the same shapes but are the outputs the same or not ? Logically, if the filters have the same values it should come to the same, right ?

But it’s a trifle more complicated than that, right? The whole point is that the filters are not “the same values”. They aren’t even the same shapes. That’s why the compute costs are different, right? It’s been a while since I originally watched these videos and so far I only rewatched the one where Prof Ng walks us through how the depthwise separable convolution works. I didn’t catch anywhere that he said that depthwise separable is equivalent to normal convolution, if the shapes end up the same. This is just my guess, until I have time to research further, but I would bet that what happens is that with either architecture, you could end up learning two mappings that have very close to the same result, but the filter values themselves will be different because the shapes are different and the whole way that the computation happens is different. If that turns out to be true, then it starts to become clear why this is a good strategy: you get the same result at much lower compute cost.

But more research is required to confirm this hypothesis …