Since the input is like the following for BERT model (transformer decoder):
input example: Thank you me to your party week
output example: inviting this
Should we add train_mask to this training data: with mask = 1 for only those masked words and 0 else, so that when calculating loss we only care about the loss of the predicted masked words?