We observe how splitting a 5x5 convolution into 2 steps by adding a 1x1 convolution pre step, reduces computation by a factor of 10. (120 million operations to 12million approximately).
While I understand how this works, by implementing this, we are also reducing the number of learnable parameters by a factor of 10.
With just the 5x5 CONV layer, 35 filters with an Input of 28x28x192 as in the video gives us (55192*32=153,000) learnable parameters.
By adding the 1x1 CONV layer(16 filters) to the same input followed by CONV 5x5, 32 filters gives us (1119216+553216=15,872) learnable parameters.
Why does this loss of parameters vs decrease in computation time, work out to our advantage.
(Trying humour out in explanation)
Hey, so I made our model 10x faster!
Really how?
It learns 10 times fewer parameters.