I am in the video " Why ResNets Work?". I am not following here, the most critical question is:
Q1. let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.
my Q1 is the most important question, allow me to have one more:
Q2: Why it’s easy to learn identity function? I understand if I use L2 to W[l+2] and weight decay to b[l+2], then most likely, they will get close to 0. so that a[l+2] = a [l]. but this is easy ?
The point is not to “learn the identity function”. That is just a starting point that you “get for free” from the skip layer. You want to learn a function that is actually interesting and applicable to the problem at hand, right? But the problem is that with very deep networks, the learning can be hard to do. What having the identity function as a starting point (well, one path in addition to the randomly initialized normal convolutional path) does is that it makes it easier for the back propagation to find a good solution without getting side tracked by vanishing and exploding gradients.
I think you should listen again to what Prof Ng actually says about this. If my memory serves, he says his own better and more complete version of what I said there in the lectures. Even if it may not seem intuitive to you, Residual Networks have been shown to work well in some important cases. That’s why Prof Ng is telling you about them. Of course as with everything here, there is no one “silver bullet” solution that works the best in every possible case. A big part of what we are learning here is a lot of different techniques and how to decide which of them might be applicable for a given type of problem that we need to solve.