C3-W1 Count Errors

I am getting an output count error after tokenizing.

  1. The remove_stopwords is working correctly for the given sentence.
  2. The output for parse_data_from_file is also matching the expected output.
  3. The fit_tokenizer has been returning 29731 instead of 29714 (difference of 17)
  4. This is leading to sequence size of 2441 instead of 2438

I understand tokenizer is the issue here. I know I can use filters but, which ones seems like a trial and error experiment. How do I identify and resolve this issue? Is there a better way?

Please click my name and message your notebook as an attachment.

def remove_stopwords(sentence) is incorrect. These are the steps:

  1. Convert sentence to lowercase
  2. Split sentence by whitespace. This should give you a list of words.
  3. Create a new list with words in sentence that are not stopwords.
  4. Use string join method to create the new sentence without any stopwords.
  5. Return the string.
2 Likes

Simply Excellent. I was looking at a wrong place. Your suggestion not only worked but, it has also simplified my code.

Thank you so much Balaji.