While going through the reading ‘Transformers: Attention is All You Need’ section, I came across a statement that I’m finding difficult to comprehend. The statement is: ‘The feed-forward network applies a point-wise fully connected layer to each position separately and identically.’ Can anyone explain this statement?
Could someone please explain why the feed-forward layer is placed after the Multihead Attention layer and not before it? What would be the implications if we didn’t include the feed-forward layer in the model?
self attention blocks are simply performing re-average of values. Imagine in bert you have 144 self attention block (12 in each layer). If there is no FFN all will act the same and similar.
Adding FFN make each of them behave like a separate small model that can be trained (get parameters). Then the whole process become like training a “stacked ensemble learning” where each model get different weight. This is not the best analogy; but the purpose of FFN is to parameterize self-attention modules. Each of FFN has 3072 hidden dimension in the Bert-base;
Thank you for your response. Could you please provide a detailed explanation of what you meant by the re-averaging of values?
When you mention that in BERT, there are 144 self-attention blocks (12 in each layer), and if there is no FFN, they would all act similarly.
Why do you believe that all the self-attention blocks would be the same? I am of the opinion that they would differ because the input varies for each multihead attention would be different
input → Multihead attention 1 → Multihead attention 2.
there is an embedding layer that takes into account the order and positioning of the token (for BERT at least). so yes, the input for multi-head attention varies because of the position/order. but don’t we need more variability (diversity) to learn other aspects of the language as well.
thanks for your reply - made to read again and think more about it.