For the programming assignment “Transformer Networks”, in exercise 3 “Self Attention” can you provide more details for the function " scaled_dot_product_attention"?
In particular, what is the structure of the formula
- softmax(Q K^T / sqrt(dk) + M ) V
- is the softmax being taken across rows or across columns - over what dimension?
- are the queries rows or columns of Q?
- what is V (rows or columns of vectors v_i ?)
- how is softmax() multiplied by V (matrix product or Kronecker product)? what are all of the dimensions
- what should the output look like? what are the dimensions of the output of the calculation
It would help enormously to have a unit test below it with all of the details and vectors as in previous weeks. This made it much easier to decipher these important details, which are not specified in this exercise.
To that end, can you provide the actual code of the unit tests for exercise 3, and exercises below it instead of just the single function calls that are there?