One of the quiz questions asks about the concept of Self-Attention. I couldn’t find an unambigously correct one among the 4 provided answers, and I chose the closest one: “its neighbouring words are used to compute its context by taking the average of those word values…”. But apparently this is not regarded as correct. In the lecture, the Attention A is expressed as \sum_i \frac{\exp(q \cdot k^{\langle i \rangle})}{\sum_j \exp(q \cdot k^{\langle j \rangle})} v^{\langle i \rangle}. The fraction term is for the softmax probability for the word i. If we write p^{\langle i \rangle} = \frac{\exp(q \cdot k^{\langle i \rangle})}{\sum_j \exp(q \cdot k^{\langle j \rangle})}, A becomes \sum_i p^{\langle i \rangle} v^{\langle i \rangle}. Isn’t this an average as \sum_i p^{\langle i \rangle} = 1?
Thanks!