C5W4 : Ungraded Lab : NER with transformers

Dhia_Znaidi · January 10, 2023, 8:01pm

Hello everyone,
While diving into the 2nd ungraded lab with transformers NER , I didn’t quite understand this remark in the notebook :
" Transformer models are often trained by tokenizers that split words into subwords. For instance, the word ‘Africa’ might get split into multiple subtokens. This can create some misalignment between the list of tags for the dataset and the list of labels generated by the tokenizer, since the tokenizer can split one word into several, or add special tokens. Before processing, it is important that you align the lists of tags and the list of labels generated by the selected tokenizer with a tokenize_and_align_labels() function."

And what does tokenize_and_align_labels() function do any different from the standard tokenization that gave us tags variable.
Thanks in advance.

balaji.ambresh · January 11, 2023, 7:16am

Subword tokenizers divide a word into smaller pieces to make use of terms in existing vocabulary. This way, instead of encoding a word with oov token, the original token is split into smaller tokens. Original NER datasets account for entity per token (at word level). When subword tokenization happens, the function tokenize_and_align_labels assigns the actual token to the 1st subtoken of the original word and associates -100 (which is like don’t care) to rest of the subtokens that make up the word.

Topic		Replies	Views
C5_w4 upgraded lab named-entity, errors in word identification Sequence Models coursera-platform	2	372	September 11, 2023
W4: Error with tokenizing and aligning labels in Named Entity Recognition Lab Sequence Models coursera-platform	3	569	July 16, 2022
C5 W4 Lab 2 and 3, tokenizer Sequence Models coursera-platform	1	545	August 9, 2021
How to tokenize data for NER NLP with Sequence Models week-module-3	1	389	September 23, 2023
Fine Tuning BERT collab NLP with Attention Models week-module-3	1	473	May 30, 2023

C5W4 : Ungraded Lab : NER with transformers

Related topics