If the short cuts in Residual Networks help quickly learn identity functions, isn’t it true that these portions of the network, where the identity has been learnt, can simply be purged after training, and the network simplified/reduced-in-size?
And if we do re-train this simplified network, the results should be as good as the original un-purged network, right? So is this really kinda like tuning the hyper parameters of network layers/neurons, except we just do it through the training process itself?
Shouldn’t this be possible even in the regular (non-residual) neural networks also? i.e., examine the parameters (weight matrices) to find the ones that are close to identity, and eliminate them to simplify?
I do realize the intuition of ResNets is also to help with backpropagation’s vanishing gradient problem.
I don’t know if it’s safe to purge a layer without careful consideration.
MLOps specifialization talks about model optimization steps for a leaner production footprint. Please check it out.
1 Like
In addition to Balaji’s point, I think you should listen again to what Prof Ng says about all this. The point is not that the goal is to learn the identity function: that is just the starting point that you get from the “skip” layers. The real point is that having that alternate path provides a “smoothing” effect on the training and allows you a better chance of avoiding vanishing or exploding gradients. That’s the real point: the skip layers give you the ability to successfully train a deeper and more complex network. But once you have the trained network, removing layers doesn’t really make sense: you’re modifying the network, so how do you know the trained parameters will still work in that different network? They were trained on a different network, right? My intuition would be that fundamentally doesn’t make sense, but this is an experimental science: you can try what you suggest and see what happens. If you learn anything one way or the other, let us know. Science! 
Disclaimer: I am just a fellow student, not a domain expert. All I know is what I’ve heard Prof Ng say in these lectures. Now that I think about it, I do remember someone mentioning that there is some work about “pruning” networks, although I don’t remember if Prof Ng ever discusses that in these courses. If you google that, here’s one paper that you find. Have a look and see if they discuss ideas similar to what you are suggesting above.
1 Like