C5_W4_A1_Transformer_Subclass_v1: Why is look_ahead mask used before the padding_mask

Just a theory question. There are two multiheaded self-attention layers. Why is look_ahead mask used before the padding_mask?

Please explain why the markdown notes in sections 2.1 and 2.2 don’t answer your question.

Ok this is my understanding from 2.1-2.2:

This is why doing padding before look_ahead is not useful according to my understanding:
Using look_ahead on a padded input isn’t useful because we would have lost some information if we apply look ahead on a padded input.

This is because padding either cuts down sentences or adds 0’s so that all sentences are the same length. So in look ahead, we could be predicting a 0 (because it got padded with 0’s) or predicting a part of the sentence we wouldn’t have wanted to (because it got cut down).

Let me know if I got something wrong. Thanks!

Look ahead is meant to treat tokens in the future as unknown when performing attention.
Consider the following text translation task from english to french:

  • Input: Complete english text as input.
  • Output: French translation.

Since we have access to the entire input sentence at the encoder, we can perform self attention across all text tokens. This is why we don’t have any look ahead mask at the encoder.

Decoder on the other hand is supposed to generate the french text with only [START TOKEN] marker signifying the previous input. When training this model, we have pairs of English and French sentences as input / output pairs.
Keep in mind that while we have the full output, at each step in the output, we can perform attention only across the tokens we’ve encountered / generated so far. As a result, we want to remove references to all tokens in future timesteps when we have data up to the current timestep. This is the purpose of the look ahead mask.

Padding mask is to ensure that softmax probabilities for padded positions are very low. This ensures that we pick a valid token as the translation at each timestep.

Please go through the markdown texts again with this information in mind.

1 Like

Thank you!

what markdown notes in sections 2.1 and 2.2 are you referring to? I am seeing only the pdfs (DLS Course 5: Lecture Notes). Could you please share the link to these markdown notes in sections 2.1 and 2.2?

I’m referring to the markdown text in the notebook (C5_W4_A1_Transformer_Subclass_v1.ipynb). See this link.