We observe how splitting a 5x5 convolution into 2 steps by adding a 1x1 convolution pre step, reduces computation by a factor of 10. (120 million operations to 12million approximately).

While I understand how this works, by implementing this, we are also reducing the number of learnable parameters by a factor of 10.

With just the 5x5 CONV layer, 35 filters with an Input of 28x28x192 as in the video gives us (5*5*192*32=153,000) learnable parameters.

By adding the 1x1 CONV layer(16 filters) to the same input followed by CONV 5x5, 32 filters gives us (1*1*192*16+5*5*32*16=15,872) learnable parameters.

Why does this loss of parameters vs decrease in computation time, work out to our advantage.

(Trying humour out in explanation)

Hey, so I made our model 10x faster!

Really how?

It learns 10 times fewer parameters.