the first paragraphs of the assignment is in contradiction with the ResNet Paper.
“We argue that this optimization difficulty is unlikely to
be caused by vanishing gradients. These plain networks are
trained with BN (Batch Normalization), which ensures forward propagated
signals to have non-zero variances. We also verify that the
backward propagated gradients exhibit healthy norms with
BN. So neither forward nor backward signals vanish.”
So if the problem is vanishing/exploding Gradients, it would be more logical to use better initialization methods(As Prof. Andrew mentions ) and also Batch normalization.
But the problem to which ResNet is a solution is something else.
So in this view, the first paragraphs of the assignment are a bit in contradiction with the source papers.