What is the use of Feed forward layer in Transformer

Ajay_Chinni · July 11, 2023, 8:24pm

While going through the reading ‘Transformers: Attention is All You Need’ section, I came across a statement that I’m finding difficult to comprehend. The statement is: ‘The feed-forward network applies a point-wise fully connected layer to each position separately and identically.’ Can anyone explain this statement?

Could someone please explain why the feed-forward layer is placed after the Multihead Attention layer and not before it? What would be the implications if we didn’t include the feed-forward layer in the model?

Dakhoo · July 11, 2023, 10:04pm

I answered a similar question on stackexchange last year,

self attention blocks are simply performing re-average of values. Imagine in bert you have 144 self attention block (12 in each layer). If there is no FFN all will act the same and similar.

Adding FFN make each of them behave like a separate small model that can be trained (get parameters). Then the whole process become like training a “stacked ensemble learning” where each model get different weight. This is not the best analogy; but the purpose of FFN is to parameterize self-attention modules. Each of FFN has 3072 hidden dimension in the Bert-base;

Ajay_Chinni · July 12, 2023, 12:30pm

Thank you for your response. Could you please provide a detailed explanation of what you meant by the re-averaging of values?

When you mention that in BERT, there are 144 self-attention blocks (12 in each layer), and if there is no FFN, they would all act similarly.

Why do you believe that all the self-attention blocks would be the same? I am of the opinion that they would differ because the input varies for each multihead attention would be different
input → Multihead attention 1 → Multihead attention 2.

Dakhoo · July 12, 2023, 10:14pm

there is an embedding layer that takes into account the order and positioning of the token (for BERT at least). so yes, the input for multi-head attention varies because of the position/order. but don’t we need more variability (diversity) to learn other aspects of the language as well.

thanks for your reply - made to read again and think more about it.

Ajay_Chinni · July 13, 2023, 6:56am

Yeah I agree

"we need more variability (diversity) to learn other aspects "

Mostly all these questions may not have one right answer. Thanks for your perspective!

Topic		Replies	Views
In attention is all you need lesson, the use of feed forward network in the encoder and decoder module is not quiet clear. It will be helpful if someone can explain it clearly Generative AI with Large Language Models week-module-1	3	40	January 4, 2025
Transformer Model Decoder Question Sequence Models coursera-platform	1	447	July 15, 2023
Does Number of Fully connected neural networks changes in transformer architechture based on max length input size? Sequence Models coursera-platform	1	502	May 5, 2023
Self-Attention in transformer network Sequence Models week-module-4 , coursera-platform	11	760	December 28, 2023
Attention is all you need paper discussions - Transformers Generative AI with Large Language Models	4	200	June 28, 2024

What is the use of Feed forward layer in Transformer

Related topics