C5_w4 upgraded lab named-entity, errors in word identification

Xiangli_Chen · September 4, 2023, 2:20am

I believe there are some errors in mapping tokens to words and then labels. To simply the explanation, let us consider the first resume. Function clean_dataset process the first resume and identify 227 words spliting using ’ ’ and then set up mapping with 227 tags. While, in funtion tokenize_and_align_labels, tokenizer directly process the text of the first resume. Print Word_idx shows that there are 318 words. Definetly, the logic of identifying word of tokenizer is different from using ’ '. So there are inconsistencies in identifying words but using the same tag matrix. So it must be an error which needs to be fixed.

TMosh · September 4, 2023, 3:16am

Sorry, I do not clearly understand what you’re discussing.

Can you post an example?

Julien_33 · September 11, 2023, 5:07am

I noticed that the tokenizer separated ‘,’ (comma) as a separate unit, while in tags array or other place, the comma itself with the word ahead were counted as 1 word, so this must cause the problem of aligning

Topic		Replies	Views
W4: Error with tokenizing and aligning labels in Named Entity Recognition Lab Sequence Models coursera-platform	3	569	July 16, 2022
Course 5 Week 4: clean_dataset() is buggy? in the Named-Entity Recognition notebook Sequence Models coursera-platform	7	570	April 6, 2022
Tokenizer labels not give the proper week1 Natural Language Processing in TensorFlow week-module-1	1	538	January 15, 2023
C5 W4 lab2 What should I equate previous_word_idx to? Sequence Models coursera-platform	1	552	May 10, 2022
C5W4 : Ungraded Lab : NER with transformers Sequence Models coursera-platform	1	499	January 11, 2023

C5_w4 upgraded lab named-entity, errors in word identification

Related topics