How are training samples created?

Yes @Peixi_Zhu you can say that the input is a 2 lists of tokens. Just to mention one additional thing - training is usually done with mini-batches - meaning there are for example 32 pairs of lists for each model weights update.

Well, because of the needed padding, 27 tokens would become 32. That would result of the output to be - 32 x 33000. But your understanding is correct.

The mask in this case would be 0 for the 5 tokens that were needed for 27 to become 32. So there would be 27 ones, and 5 zeroes ([1, 1, 1, … , 0, 0]).
When the model make predictions (for 32 tokens), the predictions are multiplied by the mask, this essentially makes the loss on padding tokens to be 0 (model is not penalized or rewarded for predicting padding tokens). This way model only “trains” to correctly predict tokens that have mask of 1.

Cheers

P.S. you might be interested in this post which explains the next_symbol function in more detail.