Hi all,
I just completed the upsampling_block()
function in the programming exercise correctly - but I have trouble understanding the implementation of the transpose convolution part. (I do understand the theory behind it, as explained in the lecture videos.)
So my main issue is with Conv2DTranspose()
. In my understanding, this operation should increase the height and width of the image it takes as input, while also halving its depth.
The first time it is applied, following Figure 2, it will take an image of dimensions (8, 8, 1024) and output one of dimensions (16, 16, 512) (which is then concatenated to the output of the skip connection and given to the Conv2D layers - this part I understand).
Now, let’s look at the arguments Conv2DTranspose()
takes:
-
filters
: the number of filters used in the transpose convolution determines the depth of its output, which should be 512 when it is applied first. The samefilters
is also used when applying the following two Conv2D layers - during which the output depth should remain constant (i.e., 512). So this argument makes sense. -
kernel_size
: this is given, so all clear. -
strides
: this is given as well. -
padding
: we’re instructed to set this to ‘same’ - but I don’t understand why. A ‘same’ convolution keeps the h, w dimensions equal before and after the convolution operation - but they should be doubled here. Why then use ‘same’? The other option would be ‘valid’, but that would shrink the dimensions and also does not make sense. Or does ‘same’ in this case just mean to use a padding of 1 (see the following)?
Then, related to the padding
argument, I have a bit of trouble understanding how, with the given kernel size and stride, an image of h, w (8, 8) can be transformed into one of h, w (16, 16) (I am leaving out the depth, which I do understand). In the lecture videos, a padding of 1 is used for the output. If I mindlessly follow this example and sketch it out, so using a (3, 3) filter, a stride of 2 and a padding of 1, I indeed manage to transform the (8, 8) image into a (18, 18) image - whose padding, I assume, will be cropped, resulting in the desired (16, 16) image. Is this how it works - simply use a padding of 1 for the output? And is the padding size somehow determined by the filter size?
Thanks a lot for any clarifications!