The ResNet is definitely an interesting concept and I was reading up more about this when I came across this post. The information loss grows towards the deeper layers is kind of intuitive to understand. So the deeper layers are contributing a “residual” (delta) value to the more “robust” values from the earlier layers.
But then the question is: does it even make sense to create very deep networks when the deeper layers are contributing only residual amounts? In general, are there ways to understand what is the optimal depth? I am on Week2 - ResNet videos, perhaps this is covered in one of the later videos?
These are not easy questions with crisp answers, which is probably why no-one answered back when you originally asked it.
I don’t know definitive answers, but here are some thoughts:
I am not an expert in the field and all I know is what I have heard Prof Ng say in the lectures in the various courses here. He does not give a general method for determining the number of layers. If there were such a method, he would probably have mentioned it. Generally speaking, he says to start with an architecture that worked well on a problem that is as similar as you can find to the problem you are trying to solve. Then you have to use the various evaluation methods that he describes in Course 2 and Course 3 to decide how to improve the performance, which might require changing the number and sizes and types of the various layers.
But residual amounts are not zero, right? The proof is in the pudding: do the deeper networks work better or don’t they? If they didn’t work better in at least some cases, then people wouldn’t use them. The point of Residual Nets is that the skip layers have a moderating effect on the training and allow you to successfully train deeper networks than you otherwise could. That’s what Prof Ng says in the lectures, as I recall.