Does Number of Fully connected neural networks changes in transformer architechture based on max length input size?

Considering the architecture of encoder and decoder in transformer as shown below:

  • Does each input token after self attention mechanism (z1,z2,z3,…)is passed to it’s specific separate Feed forward neural network or does all the Z’s are stacked into one and then passed to single FFNN?
  • If all the Z’s are stacked into one, then how the difference in shapes of different inputs is taken care
  • If every z has its own Feed forward neural network, how in practical it is implemented with arbitrary input length?

Hi Arjun_Reddy,

  • the z’s are stacked and then passed to a single feed forward layer
  • the difference in shapes of inputs is resolved by padding to the dimension of the model

Hope this clarifies.

1 Like