I think that the argument is that if you have significantly fewer parameters, that fundamentally says that the function you end up with has a lot fewer degrees of freedom and is (hence) less complex. In other words that my first “conjecture” above that you can get to the same result by either method with the appropriate level of training is not at all clear from general principles.
Here’s a thread which shows the comparison between the two styles of convolution more graphically, which may make things a bit more intuitive.