C5W4 : Ungraded Lab : NER with transformers

Hello everyone,
While diving into the 2nd ungraded lab with transformers NER , I didn’t quite understand this remark in the notebook :
" Transformer models are often trained by tokenizers that split words into subwords. For instance, the word ‘Africa’ might get split into multiple subtokens. This can create some misalignment between the list of tags for the dataset and the list of labels generated by the tokenizer, since the tokenizer can split one word into several, or add special tokens. Before processing, it is important that you align the lists of tags and the list of labels generated by the selected tokenizer with a tokenize_and_align_labels() function."

And what does tokenize_and_align_labels() function do any different from the standard tokenization that gave us tags variable.
Thanks in advance.

Subword tokenizers divide a word into smaller pieces to make use of terms in existing vocabulary. This way, instead of encoding a word with oov token, the original token is split into smaller tokens. Original NER datasets account for entity per token (at word level). When subword tokenization happens, the function tokenize_and_align_labels assigns the actual token to the 1st subtoken of the original word and associates -100 (which is like don’t care) to rest of the subtokens that make up the word.