Key_dim Multi Head attention

Hi Mentor,

Can you please help to understand the below arguments actually where it plays role in the transformer architecture ?

key_dim → Size of each attention head for query and key.|

value_dim- >Size of each attention head for value.|

1 Like

Hi @Anbu

Let me take a look and I’ll get back to you

Meanwhile, I recommend you take a look at the original transformers paper, section 3.2, where keys and values are discussed to explain how an ‘attention’ function works


Hi again @Anbu

Pondering over your question, my best answer as to how keys and values fit into a transformer, though ambiguous, is that given in the original transformers paper:

“An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key”

Have you completed the transformers programming assignment? This may help you fit things together in context

Otherwise, could you be more specific/detailed with your question? At the moment it seems too broad to answer. It may help to know where the question arose - what exercise were you doing? / What video lecture minute were you watching? / etc

Sir I check and get back. Can you please help on the below thread

1 Like