Course 4, week 3, programming assignment 2: transpose convolution implementation

Hi all,

I just completed the upsampling_block() function in the programming exercise correctly - but I have trouble understanding the implementation of the transpose convolution part. (I do understand the theory behind it, as explained in the lecture videos.)

So my main issue is with Conv2DTranspose(). In my understanding, this operation should increase the height and width of the image it takes as input, while also halving its depth.

The first time it is applied, following Figure 2, it will take an image of dimensions (8, 8, 1024) and output one of dimensions (16, 16, 512) (which is then concatenated to the output of the skip connection and given to the Conv2D layers - this part I understand).

Now, let’s look at the arguments Conv2DTranspose() takes:

  • filters: the number of filters used in the transpose convolution determines the depth of its output, which should be 512 when it is applied first. The same filters is also used when applying the following two Conv2D layers - during which the output depth should remain constant (i.e., 512). So this argument makes sense.
  • kernel_size: this is given, so all clear.
  • strides: this is given as well.
  • padding: we’re instructed to set this to ‘same’ - but I don’t understand why. A ‘same’ convolution keeps the h, w dimensions equal before and after the convolution operation - but they should be doubled here. Why then use ‘same’? The other option would be ‘valid’, but that would shrink the dimensions and also does not make sense. Or does ‘same’ in this case just mean to use a padding of 1 (see the following)?

Then, related to the padding argument, I have a bit of trouble understanding how, with the given kernel size and stride, an image of h, w (8, 8) can be transformed into one of h, w (16, 16) (I am leaving out the depth, which I do understand). In the lecture videos, a padding of 1 is used for the output. If I mindlessly follow this example and sketch it out, so using a (3, 3) filter, a stride of 2 and a padding of 1, I indeed manage to transform the (8, 8) image into a (18, 18) image - whose padding, I assume, will be cropped, resulting in the desired (16, 16) image. Is this how it works - simply use a padding of 1 for the output? And is the padding size somehow determined by the filter size?

Thanks a lot for any clarifications!

I have not read all your post in detail yet, but one important thing to realize is that the way TF handles “same” padding may be considered a bit surprising: the padding is calculated in such a way that you would get the same size if the stride == 1. But then that padding value is used with whatever the stride actually is, so the result will not be the same size unless the stride actually is 1. That’s how it works with normal convolutions anyway. My guess is that the same is true with Transposed Convolutions. Does that change your question?

Thanks, Paul, for your answer.

I am afraid that I don’t understand at all what you mean, so I am not sure if it helps with my question. (In fact, the question was a bit verbose - writing out things helps me understand them - so I highlighted the parts that are actually questions.)

Could you explain a bit more, perhaps using that formula we saw in the lecture videos?

floor(((n + 2p - f) / s) + 1)

Doesn’t the padding always depend on the values of f and s?

Yes, of course the padding depends on the values of f and s in the general case. But my point was about how TF implements the concept of “same” padding. It does not actually give you the same output size, unless the stride is 1. In other words, it solves the size equation with s = 1 in order to determine p. Try it and watch what actually happens. My statement is about normal convolutions. I have not yet tried it with transposed convolutions, but I bet the same rule applies.

OK, I understand what you are saying now. When the padding for the same convolution is calculated, a stride of 1 is assumed - and using that stride will indeed result in the output size being equal to the input size, but actually using another stride will result in different sizes.

That’s indeed surprising, and it does not directly make sense to me. Isn’t the whole point of using the same convolution to get the same input/output size, regardless of the stride? Or is there some kind of rule of thumb that a same convolution usually goes with a stride of 1?

Anyway - let’s see if I got it all correct.

  1. A same convolution always implies padding, but only actually results in the same input/output size when you use a stride of 1. So it’s just that the name is a bit misleading.

  2. If I apply this new knowledge to the first transpose convolution in the upsampling_block() graded function, where it is given that the stride is 2, I can expect that the output size will not be the same as the input size - even though I use a same convolution (i.e., set the padding argument to same).

Sketching this out, using the given filter size of 3, the given stride 2, and a padding of 1 (chosen experimentally), I indeed manage to blow up an (8, 8) image to a (18, 18) image - from which, I assume, the padding edges then will be cropped, resulting in the desired (16, 16) image.

Did I get all that right now?