Transformer Architecture

In the architecture of transformer the input to encoder blocks are embeddings for the sentence. I felt using attention mechanism we create the representation for the words which are more richer than traditional embedding models and thus embeddings should be created in encoder block rather than being fed to it.
Can you please help with the confusion


Attention mechanisms basically allowing different tokens to be weighted based on their importance

The first step in transformer is to understand the input data, so if you say you have a sentence or a sequence of data and turns each word or element into numerical representation known as vector embeddings. Then embedding is created with a combination of token embedding and positional embedding.

actually the transformer encoder layer takes up the information from the self attention which encodes the context by using three tensors – Query, Key and Value

In standard attention mechanism the query comes from the decoder, where as key and value pair is provided by the encoder.

But in self-attention mechanism the query also comes from the encoder. The matrix multiplication between the query and key tensors gives the similarity score of a token(word) with all the other tokens(words). Since the values resulting from a matrix multiplication can be large, softmax is applied to the score matrix and finally it is multiplied with the value matrix to generate the outputs of this layer.

So encoder being fed part is for the transformer encoder layer which generate the embeddings from the input tokens given and then pass those embeddings through a stack of the encoder layers.

Then the attention mechanism allows the decoder to choose a subset of these vectors to decode the translation.

1 Like


Let me add to Deepti’s post.

That is true - we need to initially represent the text data somehow.

That is also true - we create richer embeddings on the initial embeddings.
In other words, we take the initial embeddings and “make them richer” (when “looking” at the context surrounding them). The encoder block always adds “something” to initial embeddings to arrive to final embeddings.

Also note, that attention is not “everything” there is to it. In the encoder block there is also a Feed Forward Layer (FFW) which “decides” what to add to the initial embeddings after the attention is “finished” adding its part.


P.S. you might also find this more detailed post helpful