I don’t understand why we create a look-ahead mask as an upper triangular matrix filled with ones rather than a lower triangular matrix. According to the documentation for MultiHeadAttention, this is a boolean mask of shape that prevents attention to certain positions; it specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. As we pretend that the model predicted only the first part of the sentence, intuitively I expect the attention to be paid only to the beginning of the sentence, whereas the end of the sentence is unknown. Thus, shouldn’t the attention mask look like [1, 1, …, 0, 0, 0]?

[[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]]

It looks like first we pretend that the sentence ending has already been predicted and only the first word is unknown (first row), then first two words are unknown (second row), and, finally, the whole sentence is unknown (third row). It seems a bit weird to me.

Does anyone know how it actually works?

Hey @pugach, a nice question!

Initially, it seems that they initialize the attention mask with ones just for convenience. Later, the mask multiplied by -1e9 and get added to the attention weights.

\begin{aligned} (1. - \begin{bmatrix} 1. & 0. & 0. \\ 1. & 1. & 0. \\ 1. & 1. & 1. \\ \end{bmatrix} ) \times -10^{-9} &= \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} \\ QK^{T} &+ \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} = \begin{bmatrix} . & small & small \\ . & . & small \\ . & . & . \\ \end{bmatrix} \end{aligned}

The attention mask calculation is updated to match the current assignment version.

Hello. From what I can see, the mask is multiplied by -1e9 in scaled_dot_product_attention(), which seems to be used only for demonstration. In fact, Encoder/Decoder layers use MultiHeadAttention from Keras. I don’t know how Keras implements it, but since the docs say “1 indicates attention”, I assume that if something is set to -Inf, it’s more likely to be zero elements rather than ones.

It seems that you are correct. The attention mask is multiplied by -1e9 and get added to the attention weights in TF code here. I’ll inform the team about the error in the assignment.

1 Like

Hi @manifest,

I think the create_look_ahead_mask function coded as band_part(ones, num_lower=-1, num_upper=0) should return a lower triangular matrix filled with ones, not an upper triangular as the comment described.