Inception Network Clarification

Hi Sir,

In the lecture video Inception Network, how adding side branches help to prevent network from overfitting ? How it is true and can u please elaborate on it ?

From the Inception paper:

Given the relatively large depth of the network, the ability to propagate gradients back through all the
layers in an effective manner was a concern. One interesting insight is that the strong performance
of relatively shallower networks on this task suggests that the features produced by the layers in the
middle of the network should be very discriminative. By adding auxiliary classifiers connected to
these intermediate layers, we would expect to encourage discrimination in the lower stages in the
classifier, increase the gradient signal that gets propagated back, and provide additional regularization.

I don’t know the answer, but I can attempt to answer, so bear with me when I am thinking out loud:

With auxiliary classifiers, we encourage earlier layers also to make good predictions. In extreme cases, the first auxiliary classifier makes perfect predictions and is all we need. In that case, the remaining layers only learn the identity mapping. As a result, the weights in later layers aren’t being used to fit the data. We have forced the effective part of the neural network to become smaller. After the first auxiliary classifier, layers can learn the identity mapping or improve on the prediction, but the main prediction power lies in classifiers in earlier layers. In other words, auxiliary classifiers encourage earlier layers to do the heavy lifting, and a simpler network prevents overfitting compared to a deeper network with more weights. Without auxiliary classifiers, all weights would be needed to do a prediction, resulting in a bigger network more prone to overfitting.