How self attention is different from embedding layer


I cannot get my head around this that an embedding layer such as word2vec, conceptualize tokens and finds relations between words such as queen has more relation to woman than dog. Self attention also conceptualize tokens and finds relations between tokens in the input layer for example. Then how are these two layers (embedding layer with conceptualization step and self attention layer of input layer) different?


Good question.

Word embeddings finds relations between all the tokens/words in the vocabulary. The size of the vocabulary varies for different models, but for English it’s usually in the range of maybe 30000 to 50000 (you can think of it sort of like the English dictionary). This relationship is also more of a semantic/meaning-based relationship. It’s able to associate words that have some commonality in their meaning.

Self-attention, on the other hand, finds relationship between words within the input sentences/context window. The supported context window is generally smaller (although some newer models can support very large ones now), and you can think about it as the maximum number of tokens supported by a LLM. The relationship self-attention tries to identify is also different from word embeddings - you can think of the relationship here as the “role” of the words in the sentence. For example, say you have the sentence “France is a country, it is in Europe”, self-attention can be used to model the relationship between “it” with “France”. It’s not so much the meaning of the word “it” as it is how it relates to other words in the sentence (in this case, the “it” refers to “France”).


Word embeddings like Word2Vec or GloVe are trained to capture semantic relationships among words in a static manner. e.g, the vectors for ‘king’ and ‘queen’ will be closer in this vector space than the vectors for ‘king’ and ‘apple’. Self-attention mechanism computes relationships between words in a given input sequence, adjusting the relationships based on the context provided by the sequence.

Word embeddings are pre-trained and remain static during subsequent tasks and the self-attention mechanism’s weights are learned during training and can adapt to the task at hand. Self-attention can capture complex patterns and long-range dependencies in the data, which is particularly useful in tasks like translation, summarisation, and others where understanding the relationships between all parts of the input is crucial.

Both word embeddings and self-attention help in understanding relationships between words, they just do so in different manners - one in a static pre-trained manner capturing semantic similarity, and the other dynamically, learning contextual relationships within the given input.


@lawrence and @hackyon, thanks for the explanation(s).

Would it be correct to say that the Embeddings are specific to the language, and the self-attention is specific (and learned) for the data set and task being solved?

@TMosh. Yup, that makes sense!

I think the “pre-training” and “training” terms can be a little bit confusing. Both word embeddings and self-attention can technically be pre-trained and trained/fine-tuned as necessary. I think the main takeaway here is that word embeddings are usually trained separately from self-attention.

1 Like

Thanks a lot. Very nice explanation.

Great explanation, Thanks a lot.