C4W4A2, VGG19 model input shape

When loading the model and weights (vgg19_weights_tf_dim_ordering_tf_kernels_notop), do we have to use the same input shape as specified in the assignment? If I want to use the same mode and weights but different input shape, i.e., (400, 400, 3). As many images are not 1:1, it will be helpful if we can use the same weight for different input shape.

It’s an interesting question. Any halfway decent image processing library provides resize as a function. So the first approach might be to resize your inputs to the size on which the network was trained. Of course if that involves a change of aspect ratio, it might affect the performance.

But ConvNets are a bit more flexible than the Fully Connected nets we learned about in DLS Course 1 and Course 2: with FC nets you have no choice but to resize the inputs to the size on which the network was trained. With a ConvNet, the trained filters can still be applied even if the input size is different, as long as its still an RGB image (or whatever image type was used in training). But you have to look at the network architecture. E.g. if at some point down the pipeline it ends up flattening to an FC layer, then the number of elements needs to be the correct size by the point that you get there. If the input images are a different size, that may not work. I have not looked at the summary of all the layers in VGG-19, but if you want to play this game it would be worth a look. Of course you could also treat that situation as a case for “transfer learning”. Use the existing network up to the point at which it does the flatten and then remove those layers and supply your own last few layers and classifier layer and retrain with the earlier part of the net frozen (or not).

If you end up trying any of this, it would be great if you could share any useful things that you learn in the process.

1 Like

Thank you for the response. I’m not sure how transfer learning works in this case where the input aspect ratio is different from the pretrained model. Are you saying that we may only change, say, the first conv block hyperparameters (kernel size and stride etc) to make the output shape match the output of the first conv block of the pretrained model? It could mean some significant work (dig in the original training set and redo the training) to just make it work for some images that is not 1:1.

Reading the original paper of VGG-19, i found below to describe the training image size:
“Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is fixed to 224 × 224, in principle S can take on any value not less than 224: for S = 224 the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for S ≫ 224 the crop will correspond to a small part of the image, containing a small object or an object part.”

IIUC, It basically crop the image from the smallest side of the training image for small images or crop a portion of an image if it’s large.

Similarly for style transfer, we could do this:

  1. For style image, we crop the image using the smallest side with size (S, S). In another word, we only take part of the style image and use it to generate the image based on content image. Hopefully, the (S, S) image captures the “style” information from style image.
  2. For content image, we expand the image against the longest side with the padding all zeros. After all the process as described by the assignment, we will generate a 1:1 image. We then crop away the padding area to make it the same size as the origin image.

I will try this method and see how it works. Will come back to this thread later.

When I was talking about transfer learning, I did not mean to change any of the hyperparameters of the first Conv Layer, since that would require retraining the whole network. My point was to leave the “front end” of the network alone and then just remove the portion starting with the first flattened FC layer, since that’s where the mismatch of the size will finally hit you. Then you just retrain the final few layers including the output classification layer. This is what we did in the Transfer Learning with MobilNet exercise, although there the problem was not changing the image size but rather looking for something different in the image.

I have not read the VGG paper, so I don’t understand the role of cropping in the verbiage you quote. Is it only the “crop” that they actually feed through the network? If so, then maybe none if this matters as long as you leave the crop size the same. That sounds too simple, so my interpretation must not be what is happening.

1 Like