I think I answered my own question. In the original MobileNet paper, Howard et. al. 2017, the first network layer is a traditional convolution using 32 \times (3,3,3) kernels. Each of those kernels are convolved independently with the RGB image, allowing the network to optimize to 32 spatial features.
The first depthwise convolution uses 32 \times (3,3) filters. Since the feature-space of the network is first expanded to 32 by a traditional convolution, it makes me think that the purpose of the (1,1) projection convolution is to create many different “strengths” of each filter – like in an Inception Network.