I still can’t understand the working of the core of attention despite it is an important topic even after the reading of Attention Is All You Need article. And I have problems with understanding not only with this dot-product attention by the way, but with basic too.
For example we translate from English to French. So, as I understand we have matrixes:
- Q with all embeddings of French vocabulary
- K with embeddings of English sentence words.
- V the same as K
Q can be around sentence length during a training. So, we try to learn aligment (similarity) between combintations of source language K embeddings for each Q embbedding of target language with help of weights. — That’s is more-less understandable.
But what happened when we do multiplication Q to K? How does it help to understand a similarity?
This is example of your lab (don’t understand why k and v are so different by the way):
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')
v = create_tensor([[0, 1, 0], [1, 0, 1]])
so, [1, 0, 0] is embedding of first French word
[1, 2, 3] — embedding of first English word in sentence
so, after dot we will get 1. And get the same vector [0, 1, 0] after second dot of this 1 with v.
But if we will make first word embedding for example [1, 1, 1], our first dot will be 6 and second [0, 6, 0]. So, this value is higher than first and should have more propability after the softmax, isn’t it?
So, all words with more big embeddings should be with higher softmax propability in this case. It is not something, that we expect. So, can you explain how do this Scaled Dot-Product operations help us with aligments? How does it work?