Self-Attention formula

aledesa · May 1, 2024, 1:34pm

In the Self-Attention video, it i said that in the transformers architecture the self-attention of the embedding of the t-th word denoted by x^{<t>} is given by

A(q,K,V) = \sum_i \frac{\exp(q \cdot k^{<i>})}{\sum_j \exp(q \cdot k^{<j>})}v^{<i>}

Is the element q in the formula a vector such that q = q^{<t>}? And is A(q,K,V) a vector as well?

As far as I understand q^{<t>}, k^{<t>} and v^{<t>} are vectors, being linear transformations of the embedding x^{<t>}. Then the corresponding q = q^{<t>} is multiplied by each of the keys k^{<j>} so that q \cdot k is a scalar. After that, there’s a summation over the values v^{<i>} associated to each embedding each multiplied by the softmax, hence the resulting A(q,K,V) would be a vector of the same dimension of v^{<t>}, is that right?

balaji.ambresh · May 1, 2024, 4:02pm

Could you solve def scaled_dot_product_attention in the assignment and answer the question about shapes?

Topic		Replies	Views
Understanding Transformer Network Sequence Models coursera-platform	1	558	July 29, 2021
Question on Transformers Sequence Models coursera-platform	3	531	July 16, 2023
[Week 4] - Lab - Self Attention Sequence Models coursera-platform	1	625	June 4, 2021
Self-attention in the Transformer Network Sequence Models week-4 , coursera-platform	7	83	August 15, 2024
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	507	April 21, 2022

Self-Attention formula

Related topics