C4W2 Question about Decoder self-attention layer masks

@arvyzukai @Deepti_Prasad

For the first attention layer (self attention layer), shouldn’t the masking be as follows:

i. use_causal_mask = True in both training and inference modes?
ii. because of using causal_mask, the padding mask (or look_ahead_mask) are irrelevant in this layer?

Thank you

Hi @Cawnpore_Charlie

I’m not sure I understand what you mean.

The “look ahead mask” is the same thing as the “causal mask”.

And also, I’m not sure what you mean by “first” attention layer. Do you mean the encoder? If that is the case, then we do not use the causal (or look-ahead) mask in the encoder. (maybe you’re mistaking “self attention” term with “causal attention”?)

Anyways, please clarify.
Thank you.

1 Like

Sorry for not being clear.

By “first attention layer” I am referring to the self-attention layer in the Decoder.

I’m referring to the ‘use_causal_mask’ parameter of the MultiHeadAttention layer.

It is a Boolean parameter and my understanding is that if one calls the layer with use_causal_mask set to True, then one does not need to provide the look_ahead_mask as Tensorflow will automatically compute the causal_mask for each training case.

If my understanding is correct, then the assignment code can be simplified by using this parameter.

The other question I wanted to confirm is that the causal mask (lower diagonal) used both during training and during inference - correct?

Thank you.

I believe your understanding is correct but I would argue that learning is better when things are explicit. The whole Assignment code could be simplified even more (maybe to ~20 lines) , but understanding of the internal mechanisms I believe would suffer as a result.

In our implementation - yes. In theory, we could avoid using causal mask during inference.

Cheers

1 Like

Thank you.

1 Like