C5_W4_A1_Transformer_Subclass_v1: Why is look_ahead mask used before the padding_mask

Ayush_Nigade · October 1, 2023, 5:34am

Just a theory question. There are two multiheaded self-attention layers. Why is look_ahead mask used before the padding_mask?

balaji.ambresh · October 1, 2023, 6:19am

Please explain why the markdown notes in sections 2.1 and 2.2 don’t answer your question.

Ayush_Nigade · October 1, 2023, 7:10am

Ok this is my understanding from 2.1-2.2:

This is why doing padding before look_ahead is not useful according to my understanding:
Using look_ahead on a padded input isn’t useful because we would have lost some information if we apply look ahead on a padded input.

This is because padding either cuts down sentences or adds 0’s so that all sentences are the same length. So in look ahead, we could be predicting a 0 (because it got padded with 0’s) or predicting a part of the sentence we wouldn’t have wanted to (because it got cut down).

Let me know if I got something wrong. Thanks!

balaji.ambresh · October 1, 2023, 7:47am

Look ahead is meant to treat tokens in the future as unknown when performing attention.
Consider the following text translation task from english to french:

Input: Complete english text as input.
Output: French translation.

Since we have access to the entire input sentence at the encoder, we can perform self attention across all text tokens. This is why we don’t have any look ahead mask at the encoder.

Decoder on the other hand is supposed to generate the french text with only [START TOKEN] marker signifying the previous input. When training this model, we have pairs of English and French sentences as input / output pairs.
Keep in mind that while we have the full output, at each step in the output, we can perform attention only across the tokens we’ve encountered / generated so far. As a result, we want to remove references to all tokens in future timesteps when we have data up to the current timestep. This is the purpose of the look ahead mask.

Padding mask is to ensure that softmax probabilities for padded positions are very low. This ensures that we pick a valid token as the translation at each timestep.

Please go through the markdown texts again with this information in mind.

Ayush_Nigade · October 1, 2023, 11:34pm

Thank you!

Oleksandra_Sopova · October 31, 2023, 11:53pm

what markdown notes in sections 2.1 and 2.2 are you referring to? I am seeing only the pdfs (DLS Course 5: Lecture Notes). Could you please share the link to these markdown notes in sections 2.1 and 2.2?

balaji.ambresh · November 1, 2023, 4:30am

I’m referring to the markdown text in the notebook (C5_W4_A1_Transformer_Subclass_v1.ipynb). See this link.

Topic		Replies	Views
C4W2 Question about Decoder self-attention layer masks NLP with Sequence Models week-module-2	4	188	April 29, 2024
C4W2-Assignment Block 1 of DecoderLayer NLP with Attention Models week-module-2	1	27	April 9, 2025
Clarification on dec_padding_mask Sequence Models coursera-platform	1	546	April 6, 2022
Masked Attention Transformers Sequence Models coursera-platform	6	807	September 27, 2024
Intuition about the application of padding masks and look-ahead masks in Transformer's encoder/decoder Sequence Models coursera-platform	3	874	September 3, 2021

C5_W4_A1_Transformer_Subclass_v1: Why is look_ahead mask used before the padding_mask

Related topics