Transformer Decoder Mask Input

Since the input is like the following for BERT model (transformer decoder):

input example: Thank you me to your party week
output example: inviting this

Should we add train_mask to this training data: with mask = 1 for only those masked words and 0 else, so that when calculating loss we only care about the loss of the predicted masked words?

I’m not sure I understand your question. Could you elaborate more? Which model (assignment or…) are you talking about? Which mask?