Encoder blocks dimension

The encode model structure of Transformer is multiple layers of encoder blocks. However I am wondering if the output of the one encoder block can match the input of next encoder block?

encoder_block = [ 
    # add `Residual` layer
    tl.Residual(
        # add norm layer
        tl.LayerNorm(),
        # add attention
        attention,
        # add dropout
        dropout_,
    ),
    # add another `Residual` layer
    tl.Residual(
        # add feed forward
        feed_forward,
    ),

Since the encoder block starts with attention layer, I think the input should be something like (batchsize, nseq, n_heads, d_model); and since the last layer of an encoder block is a feed forward layer, I think the output should be of (batchsize, nseq, d_model). If my understanding is correct, how to fit the dimension of the next encoder block?

Thanks

Hi YIHUI,

The number of heads is a hyperparameter and does not constitute part of the input to the attention layer. Also note that the d_model dimension is first divided by the number of heads, with the resulting volumes being reconcatenated at the output. So the number of heads does not impact the input and output dimensions.

So does it mean that the input and output share the same dimension for each encoder and decoder block?

That’s how I understand it