Why inception network uses two CONVs?

It uses two CONVs in Inception Network Motivation (7:15)

1st CONV ( 1x1 with 16 filters ) to get 28x28x16
2nd CONV (5x5 with 32 filters) to get 28x28x32.

I’m wondering why they didn’t use just one CONV ( 1x1 with 32 filters) to get 28x28x32 without 2nd CONV?

But that would definitely not do the same kind of processing on the input, right? A 5 x 5 convolution does something completely different than a 1 x 1 convolution. With a 1 x 1, you get no interaction between adjacent pixels, so no ability to detect geometry. Just because you have two functions that give the same output size does not mean the functions are equivalent.

1 Like