Transformer Architecture

NIHARIKA · May 21, 2024, 4:12pm

In the architecture of transformer the input to encoder blocks are embeddings for the sentence. I felt using attention mechanism we create the representation for the words which are more richer than traditional embedding models and thus embeddings should be created in encoder block rather than being fed to it.
Can you please help with the confusion

Deepti_Prasad · May 21, 2024, 5:00pm

Hi @NIHARIKA

Attention mechanisms basically allowing different tokens to be weighted based on their importance

The first step in transformer is to understand the input data, so if you say you have a sentence or a sequence of data and turns each word or element into numerical representation known as vector embeddings. Then embedding is created with a combination of token embedding and positional embedding.

actually the transformer encoder layer takes up the information from the self attention which encodes the context by using three tensors – Query, Key and Value

In standard attention mechanism the query comes from the decoder, where as key and value pair is provided by the encoder.

But in self-attention mechanism the query also comes from the encoder. The matrix multiplication between the query and key tensors gives the similarity score of a token(word) with all the other tokens(words). Since the values resulting from a matrix multiplication can be large, softmax is applied to the score matrix and finally it is multiplied with the value matrix to generate the outputs of this layer.

So encoder being fed part is for the transformer encoder layer which generate the embeddings from the input tokens given and then pass those embeddings through a stack of the encoder layers.

Then the attention mechanism allows the decoder to choose a subset of these vectors to decode the translation.

arvyzukai · May 22, 2024, 5:06am

Hi @NIHARIKA

Let me add to Deepti’s post.

That is true - we need to initially represent the text data somehow.

That is also true - we create richer embeddings on the initial embeddings.
In other words, we take the initial embeddings and “make them richer” (when “looking” at the context surrounding them). The encoder block always adds “something” to initial embeddings to arrive to final embeddings.

Also note, that attention is not “everything” there is to it. In the encoder block there is also a Feed Forward Layer (FFW) which “decides” what to add to the initial embeddings after the attention is “finished” adding its part.

Cheers

P.S. you might also find this more detailed post helpful

Topic		Replies	Views
Attention is all you need paper discussions - Transformers Generative AI with Large Language Models	4	290	June 28, 2024
I can't quite understand the transformer structure NLP with Sequence Models week-module-4	8	1036	August 25, 2023
Conceptual Questions about Transformers Sequence Models coursera-platform	13	673	April 23, 2023
Transformer Encoder Block tl.Mean NLP with Attention Models week-module-3	5	551	May 31, 2023
Help! I still don't understand how transformer works! Sequence Models coursera-platform	3	495	August 4, 2023

Transformer Architecture

Related topics