Hello @ZHONG_Yiyuan, and @paulinpaloalto,

I think both Andrew’s result and Paul’s formula are NOT wrong. There is some ambiguity built-in to this. If we look at this formula for computing the output size of the normal convolution:

The floor operation makes the following situation possible:

That different *InputSize* gives the same *Output Size*. Now, if we are to design the transposed convolution operation, and given an input image of size (2 x 2), should the operation return a (3 x 3) or a (4 x 4) matrix?

* That is the ambiguity.* There are two possibilities in my above example.

In short, the n_{out} = 3 that you calculated using Paul’s formula, and n_{out} = 4 that you see in Andrew’s video are two of the possibilities. You will further find this make sense if you do a normal convolution with an input (3 x 3), and another normal convolution with an input (4 x 4) using a (3 x 3) kernel, stride = 2, padding = 1, then you will find both result in a (2 x 2). **Again, that is the ambiguity.**

Now, the problem comes to: how to address this in an implementation of the transposed convolution? Pytorch uses the same equation as Paul’s as far as only stride, pad, kernel size, image size are concerned. Tensorflow has different equations depending on how you parameterize it.

Therefore, Andrew’s implementation results in what you see in the video. And you can implement your own that results in the way that you described in your first post. However, what is unchanged is that, we use the steps described by Andrew to do all those element-wise multiplication and then place the results in the right place in the matrix of which the output shape is pre-computed (by formula like Paul’s, Pytorch’s, Tensorflow’s, Andrew’s, or yours).

Raymond