Considering the architecture of encoder and decoder in transformer as shown below:
- Does each input token after self attention mechanism (z1,z2,z3,…)is passed to it’s specific separate Feed forward neural network or does all the Z’s are stacked into one and then passed to single FFNN?
- If all the Z’s are stacked into one, then how the difference in shapes of different inputs is taken care
- If every z has its own Feed forward neural network, how in practical it is implemented with arbitrary input length?