Specifically for the ‘tokenize_and_mask’ function, if more than one words are masked, only one mask symbol will be generated. When fitted into the model, how can the model knows that more than one words are masked?
What model needs is to predict what token or tokens are in that place.
To be more concrete, in your example labels would be the tokens that combined make “delicious BBQ”, so the loss function would check the model’s outputted probabilities for these tokens (if they are high - average loss is not big, if they are low - average loss is big) and would return the mean of the loss for these tokens.
Note that token (or tokens) in this case are not “word” or “words” - tokens in the Assignment case are subwords (like the label for in your example is subword “a!”), one word (for example ‘going’) could be made out of couple tokens (subwords, like ‘go’ and ‘ing’ for example) or it could be whole word or sequence of words.
The model only cares about outputing high probabilities for the labels/tokens in that place.