Backpropagation in ResNets

It’s a legitimate question, but note that we have now graduated to doing everything in TensorFlow. One important side effect of using a platform like TF (or PyTorch, Kaffe or …) is that the mechanics of backpropagation and gradients are now taken care of for us by the platform “under the covers”. You can imagine what must happen in principle: the point of Residual Networks is that we have the shortcut path to provide a normalizing influence on things. So the graph of forward propagation is no longer simply connected. The flow of gradients being applied as we backpropagate must mirror the forward propagation structure, of course: at the point at which each shortcut path diverges, you will have two independent gradients feeding backwards to that node. They will need to be combined in some way. Intuitively it probably works to average them. The notebook gives the reference to the original paper that defined all this. If you want to probe more deeply, have a look and see if they mention anything about how backpropagation works in this architecture.