Yes, a convolution with filter size > 1 can extract different information from the inputs than a pointwise convolution. Maybe the way to think of pointwise is that it is a relatively efficient way to compress the information or “downsample” it, although it’s also transforming the information in addition to simply reducing the size of it. And the coefficients are learned so it’s different than a pooling layer. Because of the fact that these layers are also trained, the theory must be that they are learning to extract the important information from the larger number of inputs.
There are a number of interesting things to understand in the MobilNet architecture. The bottleneck structure expands and then contracts in each bottleneck section. Here’s another thread with some discussion about that.