The encode model structure of Transformer is multiple layers of encoder blocks. However I am wondering if the output of the one encoder block can match the input of next encoder block?
encoder_block = [
# add `Residual` layer
tl.Residual(
# add norm layer
tl.LayerNorm(),
# add attention
attention,
# add dropout
dropout_,
),
# add another `Residual` layer
tl.Residual(
# add feed forward
feed_forward,
),
Since the encoder block starts with attention layer, I think the input should be something like (batchsize, nseq, n_heads, d_model); and since the last layer of an encoder block is a feed forward layer, I think the output should be of (batchsize, nseq, d_model). If my understanding is correct, how to fit the dimension of the next encoder block?
Thanks