Understanding of Scaled Dot-Product Attention with math

someone555777 · July 27, 2023, 5:53pm

I still can’t understand the working of the core of attention despite it is an important topic even after the reading of Attention Is All You Need article. And I have problems with understanding not only with this dot-product attention by the way, but with basic too.

For example we translate from English to French. So, as I understand we have matrixes:

Q with all embeddings of French vocabulary
K with embeddings of English sentence words.
V the same as K

Q can be around sentence length during a training. So, we try to learn aligment (similarity) between combintations of source language K embeddings for each Q embbedding of target language with help of weights. — That’s is more-less understandable.

But what happened when we do multiplication Q to K? How does it help to understand a similarity?

This is example of your lab (don’t understand why k and v are so different by the way):

q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')
v = create_tensor([[0, 1, 0], [1, 0, 1]])

so, [1, 0, 0] is embedding of first French word
[1, 2, 3] — embedding of first English word in sentence

so, after dot we will get 1. And get the same vector [0, 1, 0] after second dot of this 1 with v.
But if we will make first word embedding for example [1, 1, 1], our first dot will be 6 and second [0, 6, 0]. So, this value is higher than first and should have more propability after the softmax, isn’t it?

So, all words with more big embeddings should be with higher softmax propability in this case. It is not something, that we expect. So, can you explain how do this Scaled Dot-Product operations help us with aligments? How does it work?

balaji.ambresh · July 28, 2023, 10:46am

MLOps doesn’t deal with attention mechanisms.
Please move your topic to the correct subcategory.
Here’s the community user guide to get started.

someone555777 · July 28, 2023, 5:49pm

done

balaji.ambresh · July 29, 2023, 4:34am

Thanks. Adding @arvyzukai & @Elemento

Topic		Replies	Views
Intuition reagarding why output of "scaled-dot product" attention represents similarity between tokens NLP with Attention Models course-related , week-module-2 , conceptual-question	1	232	May 1, 2024
C4_W1_Ungraded_Lab_2 - attention_qkv_result parameters NLP with Attention Models week-module-1	2	486	March 31, 2023
Scaled dot product attention implicit assumptions NLP with Attention Models week-module-1	3	413	August 17, 2023
Confusion about Q, K, and V matrices NLP with Attention Models week-module-2	9	6912	February 17, 2025
C4W2_Assignment in Natural Language Processing with Attention NLP with Attention Models week-module-3	2	70	September 2, 2024

Understanding of Scaled Dot-Product Attention with math

Related topics