How self attention is different from embedding layer

Rozita_Akrami · October 25, 2023, 3:47pm

Hi,

I cannot get my head around this that an embedding layer such as word2vec, conceptualize tokens and finds relations between words such as queen has more relation to woman than dog. Self attention also conceptualize tokens and finds relations between tokens in the input layer for example. Then how are these two layers (embedding layer with conceptualization step and self attention layer of input layer) different?

hackyon · October 25, 2023, 5:14pm

Good question.

Word embeddings finds relations between all the tokens/words in the vocabulary. The size of the vocabulary varies for different models, but for English it’s usually in the range of maybe 30000 to 50000 (you can think of it sort of like the English dictionary). This relationship is also more of a semantic/meaning-based relationship. It’s able to associate words that have some commonality in their meaning.

Self-attention, on the other hand, finds relationship between words within the input sentences/context window. The supported context window is generally smaller (although some newer models can support very large ones now), and you can think about it as the maximum number of tokens supported by a LLM. The relationship self-attention tries to identify is also different from word embeddings - you can think of the relationship here as the “role” of the words in the sentence. For example, say you have the sentence “France is a country, it is in Europe”, self-attention can be used to model the relationship between “it” with “France”. It’s not so much the meaning of the word “it” as it is how it relates to other words in the sentence (in this case, the “it” refers to “France”).

lawrence · October 26, 2023, 8:54am

Word embeddings like Word2Vec or GloVe are trained to capture semantic relationships among words in a static manner. e.g, the vectors for ‘king’ and ‘queen’ will be closer in this vector space than the vectors for ‘king’ and ‘apple’. Self-attention mechanism computes relationships between words in a given input sequence, adjusting the relationships based on the context provided by the sequence.

Word embeddings are pre-trained and remain static during subsequent tasks and the self-attention mechanism’s weights are learned during training and can adapt to the task at hand. Self-attention can capture complex patterns and long-range dependencies in the data, which is particularly useful in tasks like translation, summarisation, and others where understanding the relationships between all parts of the input is crucial.

Both word embeddings and self-attention help in understanding relationships between words, they just do so in different manners - one in a static pre-trained manner capturing semantic similarity, and the other dynamically, learning contextual relationships within the given input.

TMosh · October 26, 2023, 6:12pm

@lawrence and @hackyon, thanks for the explanation(s).

Would it be correct to say that the Embeddings are specific to the language, and the self-attention is specific (and learned) for the data set and task being solved?

hackyon · October 27, 2023, 4:06am

@TMosh. Yup, that makes sense!

I think the “pre-training” and “training” terms can be a little bit confusing. Both word embeddings and self-attention can technically be pre-trained and trained/fine-tuned as necessary. I think the main takeaway here is that word embeddings are usually trained separately from self-attention.

Rozita_Akrami · October 30, 2023, 9:08am

Thanks a lot. Very nice explanation.

Rozita_Akrami · October 30, 2023, 9:14am

Great explanation, Thanks a lot.

Topic		Replies	Views
Conceptual questions about encoder / decoder from the "Generating text with transformers" video Generative AI with Large Language Models week-1	1	200	April 13, 2024
Course 3 Embedding Layer vs Course 2's Extracting Word Embeddings NLP with Sequence Models week-1	1	186	May 16, 2024
Transformer Architecture NLP with Sequence Models week-4	2	221	May 22, 2024
Understanding the attention model in the assignment NLP with Attention Models week-1	2	360	September 8, 2023
C5W4A1 Understanding Self-Attention Sequence Models week-4	2	340	February 25, 2024

How self attention is different from embedding layer

Related topics