I believe there are some errors in mapping tokens to words and then labels. To simply the explanation, let us consider the first resume. Function clean_dataset process the first resume and identify 227 words spliting using ’ ’ and then set up mapping with 227 tags. While, in funtion tokenize_and_align_labels, tokenizer directly process the text of the first resume. Print Word_idx shows that there are 318 words. Definetly, the logic of identifying word of tokenizer is different from using ’ '. So there are inconsistencies in identifying words but using the same tag matrix. So it must be an error which needs to be fixed.
Sorry, I do not clearly understand what you’re discussing.
Can you post an example?
I noticed that the tokenizer separated ‘,’ (comma) as a separate unit, while in tags array or other place, the comma itself with the word ahead were counted as 1 word, so this must cause the problem of aligning