Course 5 Week 4 Transformer Look-ahead Mask

pugach · May 28, 2021, 12:47pm

I don’t understand why we create a look-ahead mask as an upper triangular matrix filled with ones rather than a lower triangular matrix. According to the documentation for MultiHeadAttention, this is a boolean mask of shape that prevents attention to certain positions; it specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. As we pretend that the model predicted only the first part of the sentence, intuitively I expect the attention to be paid only to the beginning of the sentence, whereas the end of the sentence is unknown. Thus, shouldn’t the attention mask look like [1, 1, …, 0, 0, 0]?

In the notebook we have a look-ahead mask printed as
[[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]]

It looks like first we pretend that the sentence ending has already been predicted and only the first word is unknown (first row), then first two words are unknown (second row), and, finally, the whole sentence is unknown (third row). It seems a bit weird to me.

Does anyone know how it actually works?

manifest · May 28, 2021, 3:55pm

Hey @pugach, a nice question!

Initially, it seems that they initialize the attention mask with ones just for convenience. Later, the mask multiplied by -1e9 and get added to the attention weights.

\begin{aligned} (1. - \begin{bmatrix} 1. & 0. & 0. \\ 1. & 1. & 0. \\ 1. & 1. & 1. \\ \end{bmatrix} ) \times -10^{-9} &= \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} \\ QK^{T} &+ \begin{bmatrix} 0. & -10^{-9} & -10^{-9} \\ 0. & 0. & -10^{-9} \\ 0. & 0. & 0. \\ \end{bmatrix} = \begin{bmatrix} . & small & small \\ . & . & small \\ . & . & . \\ \end{bmatrix} \end{aligned}

The attention mask calculation is updated to match the current assignment version.

pugach · May 28, 2021, 4:48pm

Hello. From what I can see, the mask is multiplied by -1e9 in scaled_dot_product_attention(), which seems to be used only for demonstration. In fact, Encoder/Decoder layers use MultiHeadAttention from Keras. I don’t know how Keras implements it, but since the docs say “1 indicates attention”, I assume that if something is set to -Inf, it’s more likely to be zero elements rather than ones.

manifest · May 28, 2021, 6:04pm

It seems that you are correct. The attention mask is multiplied by -1e9 and get added to the attention weights in TF code here. I’ll inform the team about the error in the assignment.

Damon · June 23, 2021, 6:59am

Hi @manifest,

I think the create_look_ahead_mask function coded as band_part(ones, num_lower=-1, num_upper=0) should return a lower triangular matrix filled with ones, not an upper triangular as the comment described.

Please check about it.
Thank you.

manifest · June 23, 2021, 8:05am

True. I didn’t update the post after the assignment got fixed, but have done it now. Thank you.

pugach · June 23, 2021, 9:37am

Thank you for updating the assignment. Unfortunately, I’m afraid it’s still not quite correct. The look-ahead mask itself now looks fine: it’s a lower triangle of ones. However, what we are doing in scaled_dot_product_attention() does not seem to match what Keras does with respect to the attention mask. As I have mentioned above, in Keras docs “1 indicates attention”, so from our mask
\begin{bmatrix} 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 1 \end{bmatrix}
we should obtain
\begin{bmatrix} . & small & small\\ . & . & small\\ . & . & . \end{bmatrix}
The previous version of the assignment simulated the correct result using an incorrect mask (recall, mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0) returned from create_look_ahead_mask). The current version of the assignment uses a correct mask, but the result of the transformation is wrong.

manifest · June 23, 2021, 2:37pm

That seems correct to me. Thanks.

Topic		Replies	Views
C5W4A1 Exercise 3 - scaled_dot_product_attention Sequence Models coursera-platform	5	1209	July 12, 2021
Intuition about the application of padding masks and look-ahead masks in Transformer's encoder/decoder Sequence Models coursera-platform	3	913	September 3, 2021
C5_W4_A1 Decoder_layer self.mha1 Sequence Models coursera-platform	12	853	August 7, 2021
[Week4] create_padding_mask: shape-confusion Sequence Models coursera-platform	2	875	May 12, 2021
Why are the diagonal components of the "look ahead mask" are 1, not 0? Sequence Models coursera-platform	1	419	June 20, 2023

Course 5 Week 4 Transformer Look-ahead Mask

Related topics