How does having outputs from intermediate hidden layers serve as a regularization?

In the Video of inception
Time Range: 5:10 - 6:00

In regards to the Inception network, I saw two main take-aways from the model structure:
-use of inception blocks

  • use of branches from intermediate layers, for predicting outputs

With regards to the second point:
If, I’m assuming, we are also doing grad descent from these branches too, in addition to the main branch, how exactly does that give a regularization effect to the model?

I saw another question about this topic but wasn’t satisfied with the answer as it discussed the use of 1x1 conv steps rather than this particular question…

Hi @Melange-Lf ,

In the Inception network, intermediate branches provide additional outputs and their corresponding losses during training (these losses are less important than the final loss). This acts as a form of regularization by encouraging the network to learn meaningful features at various depths, improving gradient flow and reducing the risk of overfitting. These auxiliary branches ensure better gradient propagation and help the network generalize better by enforcing intermediate supervision.

Hope this helps, feel free to ask if you need further assistance!

1 Like

Hello, @Melange-Lf,

I think that if those auxiliary classifiers enable the lower stages to classify well, then there will be not much work left for the upper stages and this makes any addition of new layers not contributing, thus regularizing.

However, the auxiliary classifiers won’t be very well classifying because they have a discount factor in the loss function, in other words, we are not giving all the hope to the lower stage, leaving the upper stage completely useless, but “distributing” the job of learning a bit more towards the lower stage, so that the upper stage wouldn’t be as flexible as it could have been without the auxillary classifiers.

Cheers,
Raymond

1 Like