Hi Sir,
I cannot understand to compute self attention, why Q,K and V are all to be same ? what does it mean ? can anyone please help to explain ?
Hi Sir,
I cannot understand to compute self attention, why Q,K and V are all to be same ? what does it mean ? can anyone please help to explain ?
Hi, @Anbu!
Check this post. I think we have already answered that question before. If you need some clarification after reading it, just ask and I’ll be here to help you.
Sir, sorry Im not able to understand in the post. But Im confident about architecture. can you please explain what does it means query,key and value are same ? Please Im struck kindly explain
If so query ,key and value are same then why key_dim = embedding dimension mentioned in the assignment code ?
self.mha = MultiHeadAttention(num_heads=num_heads,
key_dim=embedding_dim,
dropout=dropout_rate)
It means those three tensors have the same dimensions. If you code your embeddings in, say, vectors of 1024 values, q, k and v will be vectors of 1024 values.
okay sir
Then if q=k=v= X means then why Q = W X not happening ? Where it is actually happening ?
Also why do we need apply mask in the encoder layer ?
I was also struggling by the question why “q=k=v=X” for self-attention. For me the key to understand it is: her “q” “k” and “v” refers to the parameters when calling MultiHeadAttention
layer, i.e. those are the embedding of the words to be used to calculate the corresponding q, k, v, not the final value of q, k, v themselves.