In the ResNet-50 model suggested for the assignment, Stage 2 is the only Stage with a convolutional block with stride = 1; in other Stages, stride is always higher (=2).
Yet, when stride is 1 in the first convolutional layer of the block, no convolution is needed on shortcut path! Indeed, dimensions of first-layer input and last-layer output of the block match up.
So, I think that Stage 2 might be done of three only identity blocks; no convolutional block is needed. Does this make sense for you?
Looking at the paper which introduced ResNet50, it right seems that authors used an identity block in that stage:
In this paper, we are performing the experiment to learn deep residual learning. In one of the applications, we are using identity mapping by shortcuts on two different set of architectures having same kind of parameters: the first one is on a plain network and the other one is on a residual mode network.
The plain network is developed on the concept of VGG nets and thus, through identity mapping, we are trying to find the results, whereas in the residual network i.e based on the same pattern as above, we insert the identity shortcuts directly when the input and output are having the same dimensions. Simultaneously, when the dimensions get increased, either we are trying to use the zero padding input volume or we are trying to use the 1*1 convolutions.
So, it’s an on and off going process to get the desired outcome.
Later (as cited in the paper), they are evaluating the method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes, where the models are trained on the 1.28 million training images, and evaluated on the 50k validation images.
Let me know if I have tried to prove your point clear. Thanks!