I am getting an output count error after tokenizing.
The remove_stopwords is working correctly for the given sentence.
The output for parse_data_from_file is also matching the expected output.
The fit_tokenizer has been returning 29731 instead of 29714 (difference of 17)
This is leading to sequence size of 2441 instead of 2438
I understand tokenizer is the issue here. I know I can use filters but, which ones seems like a trial and error experiment. How do I identify and resolve this issue? Is there a better way?