C4W2 Question about Decoder self-attention layer masks

Cawnpore_Charlie · April 28, 2024, 9:09pm

For the first attention layer (self attention layer), shouldn’t the masking be as follows:

i. use_causal_mask = True in both training and inference modes?
ii. because of using causal_mask, the padding mask (or look_ahead_mask) are irrelevant in this layer?

Thank you

arvyzukai · April 29, 2024, 4:53am

Hi @Cawnpore_Charlie

I’m not sure I understand what you mean.

The “look ahead mask” is the same thing as the “causal mask”.

And also, I’m not sure what you mean by “first” attention layer. Do you mean the encoder? If that is the case, then we do not use the causal (or look-ahead) mask in the encoder. (maybe you’re mistaking “self attention” term with “causal attention”?)

Anyways, please clarify.
Thank you.

Cawnpore_Charlie · April 29, 2024, 6:31am

Sorry for not being clear.

By “first attention layer” I am referring to the self-attention layer in the Decoder.

I’m referring to the ‘use_causal_mask’ parameter of the MultiHeadAttention layer.

It is a Boolean parameter and my understanding is that if one calls the layer with use_causal_mask set to True, then one does not need to provide the look_ahead_mask as Tensorflow will automatically compute the causal_mask for each training case.

If my understanding is correct, then the assignment code can be simplified by using this parameter.

The other question I wanted to confirm is that the causal mask (lower diagonal) used both during training and during inference - correct?

Thank you.

arvyzukai · April 29, 2024, 10:32am

I believe your understanding is correct but I would argue that learning is better when things are explicit. The whole Assignment code could be simplified even more (maybe to ~20 lines) , but understanding of the internal mechanisms I believe would suffer as a result.

In our implementation - yes. In theory, we could avoid using causal mask during inference.

Cheers

Cawnpore_Charlie · April 29, 2024, 3:20pm

Thank you.

Topic		Replies	Views
C5_W4_A1_Transformer_Subclass_v1 DecoderLayer Class why not "use_causal_mask" Sequence Models week-4	1	40	July 18, 2024
C4W2-Assignment Block 1 of DecoderLayer NLP with Attention Models week-2	1	22	April 9, 2025
Masked Attention Transformers Sequence Models	6	795	September 27, 2024
C5_W4_A1_Transformer_Subclass_v1: Why is look_ahead mask used before the padding_mask Sequence Models	6	587	November 1, 2023
C5_W4_A1 UNQ_C4 Encoder Layer Mask Sequence Models	16	1063	August 3, 2021

C4W2 Question about Decoder self-attention layer masks

Related topics