Course 5 Week 4 Transformer Look-ahead Mask

I don’t understand why we create a look-ahead mask as an upper triangular matrix filled with ones rather than a lower triangular matrix. According to the documentation for MultiHeadAttention, this is a boolean mask of shape that prevents attention to certain positions; it specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. As we pretend that the model predicted only the first part of the sentence, intuitively I expect the attention to be paid only to the beginning of the sentence, whereas the end of the sentence is unknown. Thus, shouldn’t the attention mask look like [1, 1, …, 0, 0, 0]?

In the notebook we have a look-ahead mask printed as
[[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]]

It looks like first we pretend that the sentence ending has already been predicted and only the first word is unknown (first row), then first two words are unknown (second row), and, finally, the whole sentence is unknown (third row). It seems a bit weird to me.

Does anyone know how it actually works?

Hey @pugach, a nice question!

Initially, it seems that they initialize the attention mask with ones just for convenience. Later, the mask multiplied by -1e9 and get added to the attention weights.

\begin{aligned} (1. - \begin{bmatrix} 1. & 0. & 0. \\ 1. & 1. & 0. \\ 1. & 1. & 1. \\ \end{bmatrix} ) \times -10^{-9} &= \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} \\ QK^{T} &+ \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} = \begin{bmatrix} . & small & small \\ . & . & small \\ . & . & . \\ \end{bmatrix} \end{aligned}

The attention mask calculation is updated to match the current assignment version.

Hello. From what I can see, the mask is multiplied by -1e9 in scaled_dot_product_attention(), which seems to be used only for demonstration. In fact, Encoder/Decoder layers use MultiHeadAttention from Keras. I don’t know how Keras implements it, but since the docs say “1 indicates attention”, I assume that if something is set to -Inf, it’s more likely to be zero elements rather than ones.

It seems that you are correct. The attention mask is multiplied by -1e9 and get added to the attention weights in TF code here. I’ll inform the team about the error in the assignment.

1 Like

Hi @manifest,

I think the create_look_ahead_mask function coded as band_part(ones, num_lower=-1, num_upper=0) should return a lower triangular matrix filled with ones, not an upper triangular as the comment described.

Please check about it.
Thank you.

True. I didn’t update the post after the assignment got fixed, but have done it now. Thank you.

Thank you for updating the assignment. Unfortunately, I’m afraid it’s still not quite correct. The look-ahead mask itself now looks fine: it’s a lower triangle of ones. However, what we are doing in scaled_dot_product_attention() does not seem to match what Keras does with respect to the attention mask. As I have mentioned above, in Keras docs “1 indicates attention”, so from our mask
\begin{bmatrix} 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 1 \end{bmatrix}
we should obtain
\begin{bmatrix} . & small & small\\ . & . & small\\ . & . & . \end{bmatrix}
The previous version of the assignment simulated the correct result using an incorrect mask (recall, mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0) returned from create_look_ahead_mask). The current version of the assignment uses a correct mask, but the result of the transformation is wrong.

That seems correct to me. Thanks.