Q, K, V structure is not an original technology of Transformer.
Simple âattentionâ mechanism uses a âSource-Targetâ type reference model. It only has a direct reference among âtargetâ and âsourceâ like the left-hand side chart below. (A diagram is from FRUSTRATINGLY SHORT ATTENTION SPANS IN NEURAL LANGUAGE MODELING.
But, this direct relationship may not be flexible if âthese two are quite similar and should have relations, i.e., attention, but a referred word needs other word to be meaningfulâ. Then, âsourceâ was separated into âkeyâ and âvalueâ, and âtargetâ was renamed as âqueryâ, just like the right-hand side chart. Later, this key-value pairs are called âdictionaryâ. Details are in Key-Value Memory Networks for Directly Reading Documents. (In the past work, K-V is also called âmemoryâ.)
So, straightforward implementation of this Q,V,K system is âSource/Target attentionâ, which is the 2nd MHA in our assignment. In the case of English->French translation, âSourceâ is Encoder side which has âEnglishâ dictionary, and âTargetâ is Decoder that has French reference sentence (label).
In here,
Q : Target (output from 1st MHA in Decoder, French sentence)
K/V : Source (output from Encoder, English sentence)
(Note that âsentenceâ is not a list of word, of course. It is âword embeddingâ + âposition encodingâ.)
Then, create attention weights to translate English sentence to French sentence.
Now, letâs go to âSelf-attentionâ. (I think starting from âSelf-attentionâ may not be appropriate without having knowledge of Transformer overview.)
The first step in MHA for self-attention is to find similarity between Q (query) and K (key).
It is simply done by "dot productâ. Remember that âdot productâ is basically a âcosine similarityâ from its definition.
a\cdot b = \parallel a\parallel \parallel b\parallel\cos\theta
If two vectors are similar, then, \cos\theta becomes close to 1. I suppose you remember âword embeddingâ. Thatâs Iâm referring. This is one aspect of this MHA, but, think good for âintuitionsâ 
With this, we can create a similarity map, which is a base for attention weights. The key point is that we Q (and K, V also), includes both âword embedding and position encodingâ. If it is only âword embeddingâ, then a cosine similarity is just for a word itself, not including any position information. Thatâs not what âattentionâ expects. With adding 'Position encoding", then, we can define a similarity from both âword vectorâ and âword positionâ view points
Then, we apply masks and create attention weights with Softmax (and some scale factors based on d_{model}, which is equal to âembedding_dimâ in out case). Finally, we get the final output by a dot product of attention weights and V. Important thing is a mapping between k and v is also trainable.
Hope this helps some.