Transformer Pre-processing Max Sequence Length

In the optional Transformer Preprocessing Lab, I came across this:

Define the embedding dimension as 100. This value must match the dimensionality of the word embedding. In the “Attention is All You Need” paper, embedding sizes range from 100 to 1024, depending on the task. The authors also use a maximum sequence length ranging from 40 to 512 depending on the task. Define the maximum sequence length to be 100, and the maximum number of words to be 64.

Question: what is the difference between maximum sequence length and the maximum number of words? Isn’t sequence made up of words (so if the maximum number of words is 64, wouldn’t the maximum sequence also be 64)? Why define both?


Here, “maximum number of words” is the maximum number of words retained when building the tokenizer, other words will be ignored. In other words, the tokenizer only recognizes these words.