I cannot get my head around this that an embedding layer such as word2vec, conceptualize tokens and finds relations between words such as queen has more relation to woman than dog. Self attention also conceptualize tokens and finds relations between tokens in the input layer for example. Then how are these two layers (embedding layer with conceptualization step and self attention layer of input layer) different?
Word embeddings finds relations between all the tokens/words in the vocabulary. The size of the vocabulary varies for different models, but for English itās usually in the range of maybe 30000 to 50000 (you can think of it sort of like the English dictionary). This relationship is also more of a semantic/meaning-based relationship. Itās able to associate words that have some commonality in their meaning.
Self-attention, on the other hand, finds relationship between words within the input sentences/context window. The supported context window is generally smaller (although some newer models can support very large ones now), and you can think about it as the maximum number of tokens supported by a LLM. The relationship self-attention tries to identify is also different from word embeddings - you can think of the relationship here as the āroleā of the words in the sentence. For example, say you have the sentence āFrance is a country, it is in Europeā, self-attention can be used to model the relationship between āitā with āFranceā. Itās not so much the meaning of the word āitā as it is how it relates to other words in the sentence (in this case, the āitā refers to āFranceā).
Word embeddings like Word2Vec or GloVe are trained to capture semantic relationships among words in a static manner. e.g, the vectors for ākingā and āqueenā will be closer in this vector space than the vectors for ākingā and āappleā. Self-attention mechanism computes relationships between words in a given input sequence, adjusting the relationships based on the context provided by the sequence.
Word embeddings are pre-trained and remain static during subsequent tasks and the self-attention mechanismās weights are learned during training and can adapt to the task at hand. Self-attention can capture complex patterns and long-range dependencies in the data, which is particularly useful in tasks like translation, summarisation, and others where understanding the relationships between all parts of the input is crucial.
Both word embeddings and self-attention help in understanding relationships between words, they just do so in different manners - one in a static pre-trained manner capturing semantic similarity, and the other dynamically, learning contextual relationships within the given input.
Would it be correct to say that the Embeddings are specific to the language, and the self-attention is specific (and learned) for the data set and task being solved?
I think the āpre-trainingā and ātrainingā terms can be a little bit confusing. Both word embeddings and self-attention can technically be pre-trained and trained/fine-tuned as necessary. I think the main takeaway here is that word embeddings are usually trained separately from self-attention.