ResNets Question

Hi,

While I understand why Residual Nets might be beneficial, I don’t understand why would we get ourselves into a place where we need to use them.

If we are trying to skip some layers, why would we add these extra layers from the first place and then try to skip them? Also, why not simply remove some extra layers and make the NN less deep? Does it have something to do with Backprop?

Also, Professor Andrew shows below that a layer A[l+n] would cancel according to the screenshot below. So technically, we are doing g(a[l]).

As you know, Convolutional network is trying to extract different characteristics from images (texts, and so on), and use those characteristics (like “vertical line”, “edge”, “diagonal line”, …) to identify what it is. A deeper network can provide different types of characteristics by applying several filters in multiple layers, which can potentially provide higher accuracy comparing to a shallow network.
So, we want to use a deeper network. Then, there are some problems. The most critical one is, we can not optimize it. One reason is, of course, “computational powers”. So, we need some balancing, i.e, “expected quality” v.s. “computational resource requirements”. But, the most critical point is the problem in an optimization algorithm.
As we learned, back-propagation is used to deliver “gradients”, which is proportional to partial derivatives, throughout the network. Starting from the very last layer of the network, we continuously calculate “gradients” of each layer to be “back-propagated”. The problem is that this “gradients” can be easily going to small values or even zero in a deep network, because the gradient is calculated by a “chain-rule” of partial derivatives of layers behind the target layer.
Resinet proposed an excellent way to avoid that. It carries an output of a particular layer and adds it to the latter layer to make the output signal larger. With this, we can back-propagate “gradients” from the very last layer to the very early layer of a deep network.
Concerning to your screenshot, Andrew talked about some advantages of this residual block in the case of the weight goes to zero. As we learned, in the case of L2 regularization, if \lambda is a large value, updated weights can easily be a small value or zero. Once the weight goes to zero (and Andrew talked the bias goes to zero), then, that neuron does not work. But, if we carry a residual block to here, even in that case, we can keep a signal output from that neuron. I suppose that is what Andrew wants to talk.

Hope this helps some…

4 Likes

In addition to Nobu’s detailed explanation, one other high level point is that the key thing to realize is that we’re not discarding the layers we skip, right? They are still there and provide input to the later layers. We’re introducing another alternate parallel simpler path for the data through the network. The skip layers split off and then rejoin at the later point, right? So the point is you have two inputs at that junction and also two paths going in the reverse direction for propagating gradients.

It may not be immediately intuitive why that would help, but it turns out that it does. It makes it easier to train such a deep network: the skip layers provide a “moderating” influence on the gradients during training that helps with convergence. Prof Ng explains why in the lectures and Nobu has also given you another wording of that explanation.

1 Like

Thanks Nobu and Paul!

The main misconception that I had is that I thought we are discarding the layers.

1 Like

My question here totally didn’t make sense.

Lesson learned, never watch the lectures late at night :slight_smile:

thank you - this was my impression as well

1 Like