Hello everyone,
While diving into the 2nd ungraded lab with transformers NER , I didn’t quite understand this remark in the notebook :
" Transformer models are often trained by tokenizers that split words into subwords. For instance, the word ‘Africa’ might get split into multiple subtokens. This can create some misalignment between the list of tags for the dataset and the list of labels generated by the tokenizer, since the tokenizer can split one word into several, or add special tokens. Before processing, it is important that you align the lists of tags and the list of labels generated by the selected tokenizer with a tokenize_and_align_labels()
function."
And what does tokenize_and_align_labels() function do any different from the standard tokenization that gave us tags variable.
Thanks in advance.