Andrew skipped details of the inception module. Here is an original paper. (We sometimes need to check with an original paper…)
The tile of the paper is “Going deeper with convolutions”. As a title says, it includes several techniques to achieve a deeper convolutional network. And, one of challenges is reducing computational power with keeping feature map capabilities.
Here is a figure from the paper.
Authors added 1x1 convolution layer for 3x3 convolutions, 5x5 convolutions and 3x3 max pooling. All are for dimension reductions to reduce the computational power requirement. (And, Andrew skipped this portion, and used the picture of naive version.)
As the result, inception(3a) layer successfully created 28x28x256 feature map with keeping a variety of characteristics in there.
Hope this clarifies.