Scaled dot product attention implicit assumptions

Scaled dot product attention was introduced as follows -, K.T)  ---> "**similarity**" of each word in Q with every word in Keys

In case of machine translation from English to German, Q was said to be the German word already decoded and Key was each English word in input sentence.

Ques 1 : We have English and German embeddings both in same d_model dimensions but it doesn’t mean that English embedding for any word will be in the same general direction as corresponding German word (which is when the dot product will give max similarity). For d_model = 3 say the word “I” in English can be [0.5, 0.1, 0.8] while “ich” can be [-0.5, 0.8, -0.9]. Clearly the dot product between the two wont be similar.

Ques 2: Number of outputs of attention is equal to number of query words. In case of cross attention will it be no of input words + no. of decoded words?

Hi @Mayank11 ,

Regarding Q1: Why do you think that “I” in english could have a certain embedding and “ich” in German could have a “far away” location in this multi-dimensional space? I’d like to understand more your intuition on this to follow up with my comments based on that.

Q2: The number of outputs of attention, after the concatenation of the different outputs of the heads is going to be equal to the context size. For example, if the model’s context size is 512, then the output of the attention module will be 512. This is a fixed number. Another thing is the number of words of the input. For example, you can have model with dimension = 512 in its input (512 tokens) but the sentence has 100 tokens. Still the output of the attention modules will be 512, where the first portion contains information about the input sentence, and the rest is padding.


@Juan_Olano Thanks for responding.

On Q1 : My intuition is that embedding algorithm essentially clusters similar meaning words in a language together in a higher dimensional vector space. Since English and German are different languages and may have unique characteristics, “I” in English could be very far from “ich” in German even if they mean the same. In an extreme case, what if German embedding vectors are just 90 degree rotated version of English vectors?

For cross-attention, inputs are encoder output as keys and values and decoder self attention output as queries. Basically then similarity is calculated between final encoder output and self-attention output (which will change each time a new word is received from decoder) of decoder. Can we expect any similarity between final encoded vector of “I” with 1st self-attention output of “ich” vector in decoder which itself is constantly changing as decoder is generating more and more words?

I can get around this thinking that maybe the original embedding for both languages itself will be learnt during this process and it will ensure that German and English word embeddings are similar even after multiple attention layers. But then this approach does not appeal to me. I don’t know if in this approach, for example, will I be able to just extract German embeddings and use them in a different task. Something like a CBOW that independently calculates embeddings for your vocabulary and allows you to use them in your tasks gives a lot more flexibility and I would use something like that in the model. That way, training time and model size might reduce considerably as well. But, then vector for “ich” can be very different from vector for “I”.

Q2: This makes sense. Size can be context size all the time.

Thank you for your reply.

Lets remember that the transformer is trained “in parallel” in the 2 languages, meaning that we pass the English sentence and its corresponding German translation. During this training, the transformer will receive both the English and the German phrases, and it is in this process that the transformer learns to align the words of both languages, by paying attention to the context of the words in the phrases. To do this efficiently, the transformer is paying attention from different ‘perspectives’ thanks to the ‘multi-headed’ attention modules. Each head learns one ‘perspective’ or ‘aspect’ of the sentences, and learns the correct relationship between the english words in the sentence with the german words in the translated sentence. So even if “I” and “Ich” are not necessarily in proximity in the embeddings matrix, the model uses attention, context, and yes, to a certain point the embedding matrix, to find the correct translation.