How conv2dtranspose works

Conv2DTranspose (which, I will subsequently call Conv2T) was used a number of times during explaining different image segmentation architectures. At first, I did not really understand how it worked, and only glossed over it. But, after going through the model summary, I was able to figure out how Conv2T worked, and I will be explaining it using the FCN-8 decoder architecture and code.

In order to understand how Conv2T works, you should be familiar with the way a typical Conv2D works. Conv2T is trying to reverse the effect of Conv2D.

Leaving out the channel of an input feature (whether it is an image or an intermediate activation function), Conv2D works on reducing the width and height of an input feature with the following formula.

`

output = ( (input - kernel_input)/strides ) + 1

`

where input is either weight or height of the input, and output is respectively weight or height after performing convolution. Note that this assumes that there is no padding, in order to simplify things.

With this formula in mind, if the input of an image is (7, 7). And you use a kernel size of (3, 3), and a stride of (2, 2). Our final output will be:

output = ( (7 - 3) / 2) + 1
output = 3 (The output feature will be (3, 3))

NOW, to Conv2T. Conv2T simply reverses the formula.

To avoid confusion, let us make input the subject of the formula from the convolution equation above;
Thus,

input = (output - 1) * strides + kernel_input.

However, since we are reversing the process, the input from actual Conv2D becomes the output in Conv2T, and the output from the Conv2D becomes the input for Conv2T

Therefore the formula for obtaining the reversal process of Conv2T is properly given below:

output = (input - 1) * strides + kernel_input.

Let us try this formula with the example we did in Conv2D.
Our input is (3, 3), kernel size of (3, 3), and a stride of (2, 2).

output = ((3 - 1) * 2) + 3
output = 7 (Which is the original size of the image we wanted)

From what I have seen so far, I think it would be important to note that Conv2T is not a perfect way to reverse the process, but it works.

Let us demonstrate how this works with FCN-8 architecture.

In the typical FCN-8 encoder that we used (VGG-16, with some additional layers), we saw that image went from 224 to 112(p1) to 56(p2), to 28(p3), to 14(p4), and finally to 7(p5). See part of the model summary below, with the image and noted outputs boldened:

input_1 (InputLayer) [(None, 224, 224, 3 0
)]

block1_conv1 (Conv2D) (None, 224, 224, 64 1792 [‘input_1[0][0]’]
)

block1_conv2 (Conv2D) (None, 224, 224, 64 36928 [‘block1_conv1[0][0]’]
)

block1_pool2 (MaxPooling2D) (None, 112, 112, 64 0 [‘block1_conv2[0][0]’]
)

block2_conv1 (Conv2D) (None, 112, 112, 12 73856 [‘block1_pool2[0][0]’]
8)

block2_conv2 (Conv2D) (None, 112, 112, 12 147584 [‘block2_conv1[0][0]’]
8)

block2_pool2 (MaxPooling2D) (None, 56, 56, 128) 0 [‘block2_conv2[0][0]’]

block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 [‘block2_pool2[0][0]’]

block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 [‘block3_conv1[0][0]’]

block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 [‘block3_conv2[0][0]’]

block3_pool3 (MaxPooling2D) (None, 28, 28, 256) 0 [‘block3_conv3[0][0]’]

block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 [‘block3_pool3[0][0]’]

block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 [‘block4_conv1[0][0]’]

block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 [‘block4_conv2[0][0]’]

block4_pool3 (MaxPooling2D) (None, 14, 14, 512) 0 [‘block4_conv3[0][0]’]

block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 [‘block4_pool3[0][0]’]

block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 [‘block5_conv1[0][0]’]

block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 [‘block5_conv2[0][0]’]

block5_pool3 (MaxPooling2D) (None, 7, 7, 512) 0 [‘block5_conv3[0][0]’]

conv6 (Conv2D) (None, 7, 7, 4096) 102764544 [‘block5_pool3[0][0]’]

conv7 (Conv2D) (None, 7, 7, 4096) 16781312 [‘conv6[0][0]’]

Also note that the MaxPooling is the one doing the job of reducing the width and height, which is itself a Convolution Layer too.

From the FCN-8 decoder, we know that we first have to upsample p5 by 2x - This means that for each pixel, we should create 2 for it, and our desired output is a 14 x 14 result.

From the code:

tf.keras.layers.Conv2DTranspose(n_classes , kernel_size=(4,4) , strides=(2,2) , use_bias=False )

We see that we used a kernel_size of 4 and strides of 2.

From the Conv2T formula

output = (input - 1) * strides + kernel_input.

The output of this will be (7 - 1) * 2 + 4, which equals 16. But our desired output is 14. This is why a cropping layer (tf.keras.layers.Cropping2D(cropping=(1,1))) follows this layer to crop the edges, one from the width, and one from the height. And our final output becomes 14.

This output is then combined with p4, which has the same height and width as upsampled p5.

This 2x upsampling with Conv2T is also performed on the combination of p4 and p5 to get an output height and width of 28, which is then combined with p3.

This combined result is then 8x upsampled using the following filter; kernel_size=(8,8) , strides=(8,8)

Plugging this into our Conv2T formula, we get this

(28 - 1) * 8 + 8
And this equal 224

This gives us the final result, which is the final upsampled image into the same shape (Height and width) as the input image.
Notice, how this particular layer does not have a cropping layer to follow it.

See the summary of the decoder part below, with the major upsampling highlighted and boldened;

conv2d_transpose (Conv2DTransp (None, 16, 16, 12) 786432 [‘conv7[0][0]’]
ose)

cropping2d (Cropping2D) (None, 14, 14, 12) 0 [‘conv2d_transpose[0][0]’]

conv2d (Conv2D) (None, 14, 14, 12) 6156 [‘block4_pool3[0][0]’]

add (Add) (None, 14, 14, 12) 0 [‘cropping2d[0][0]’,
‘conv2d[0][0]’]

conv2d_transpose_1 (Conv2DTran (None, 30, 30, 12) 2304 [‘add[0][0]’]
spose)

cropping2d_1 (Cropping2D) (None, 28, 28, 12) 0 [‘conv2d_transpose_1[0][0]’]

conv2d_1 (Conv2D) (None, 28, 28, 12) 3084 [‘block3_pool3[0][0]’]

add_1 (Add) (None, 28, 28, 12) 0 [‘cropping2d_1[0][0]’,
‘conv2d_1[0][0]’]

conv2d_transpose_2 (Conv2DTran (None, 224, 224, 12 9216 [‘add_1[0][0]’]
spose) )

Thank you. I am open to any contribution to this post.

this post is related to tf-at course 3 week 3 assignment of fcn8 section right, where we need to code 5 blocks layers?

Yes, it is. @Deepti_Prasad

by your post, what I understood, we need to use this code for the fcn8 section

tf.keras.layers.Conv2DTranspose(n_classes , kernel_size=(4,4) , strides=(2,2) , use_bias=False ) for different blocks ( 1-5)

am I right??

@Deepti_Prasad, I am not sure what the assignment looks like now. But if I am correct, you will be using the Conv2DTranspose to upsample your images when using fcn8 architecture.

The post I made is not to exactly give you which code to use or not. It is more to explain how Conv2DTranspose works, and how to upsample images to double, triple, quadruple, and to other sizes.

You can ask further if you need more clarity.

No I was asking for conv_block for image_ordering encoder code. I solve the issue. The ungraded lab helped me understand.

1 Like

David I do have a doubt!!!
why do we need to use cropping layer to crop the image as even max pooling does the same.

as I understand maxpooling function is to reduce or downsample the dimensionality of the input image
Cropping layer for 3D data (e.g. spatial or spatio-temporal).

Cant max pooling have an added function of crop to get rid of the cropping layer in model architecture?

Just a doubt?

Thank you
DP

When using the Cropping layer, it is different from using max-pooling. We use the Cropping layer to get rid of excess pixels, for instance reducing a 30, 30 shaped pixel into a 28, 28 pixel. This just removes the edges of the 30, 30 pixel image.

In the case of MaxPooling, it functions more like a kernel-sized operation of a Convolution. For instance, if we use a 2, 2 MaxPooling layer with 30, 30 image, we have just a pixel return for each 2, 2 kernel we slide over the image.
In the case where the stride is 1, we can result in a 15, 15 image instead of the desired 28, 28 pixels that we desire.

So basically you are stating cropping will only reduce the dimensions lesser than the max pooling and that is why we need to use a separate cropping layer.

I would say that cropping helps us to remove the excess pixels usually from the edge. It finds use during Upsampling when our convtranspose gives us a result that is excess what we need. E.g instead of 28, 28 pixel desired at a given point, the convtranspose could generate a 30, 30 pixel result. The 30, 30 generated is an excess of about 2, 2 pixels excess. This 2, 2 excess typically needs to be removed FROM THE EDGE of the 30, 30 excess, Hence the need to crop.

On the other hand, Pooling (MaxPooling or AvgPooling) takes a sliding window over your result and reduce every x, x dimension of the result while sliding over the result. E.g A 2, 2 kernel sized maxpooling will first go over the the first 2, 2 pixels of the result then slides to the next set of 2, 2 pixel within the sam result.

What both of them do is quite different, and it’s not so much about the magnitude of what they reduce an image, or feature matrix into.

1 Like