Encoder blocks dimension

The encode model structure of Transformer is multiple layers of encoder blocks. However I am wondering if the output of the one encoder block can match the input of next encoder block?

encoder_block = [ 
    # add `Residual` layer
        # add norm layer
        # add attention
        # add dropout
    # add another `Residual` layer
        # add feed forward

Since the encoder block starts with attention layer, I think the input should be something like (batchsize, nseq, n_heads, d_model); and since the last layer of an encoder block is a feed forward layer, I think the output should be of (batchsize, nseq, d_model). If my understanding is correct, how to fit the dimension of the next encoder block?



The number of heads is a hyperparameter and does not constitute part of the input to the attention layer. Also note that the d_model dimension is first divided by the number of heads, with the resulting volumes being reconcatenated at the output. So the number of heads does not impact the input and output dimensions.

So does it mean that the input and output share the same dimension for each encoder and decoder block?

That’s how I understand it