C3-W1 Count Errors

SVM001 · July 27, 2022, 2:58am

I am getting an output count error after tokenizing.

The remove_stopwords is working correctly for the given sentence.
The output for parse_data_from_file is also matching the expected output.
The fit_tokenizer has been returning 29731 instead of 29714 (difference of 17)
This is leading to sequence size of 2441 instead of 2438

I understand tokenizer is the issue here. I know I can use filters but, which ones seems like a trial and error experiment. How do I identify and resolve this issue? Is there a better way?

balaji.ambresh · July 27, 2022, 5:16am

Please click my name and message your notebook as an attachment.

balaji.ambresh · July 27, 2022, 1:28pm

def remove_stopwords(sentence) is incorrect. These are the steps:

Convert sentence to lowercase
Split sentence by whitespace. This should give you a list of words.
Create a new list with words in sentence that are not stopwords.
Use string join method to create the new sentence without any stopwords.
Return the string.

SVM001 · July 28, 2022, 9:27am

Simply Excellent. I was looking at a wrong place. Your suggestion not only worked but, it has also simplified my code.

Thank you so much Balaji.

Topic		Replies	Views
C3W1 incorrect word count from fit_tokenizer() function Natural Language Processing in TensorFlow week-1	6	475	December 9, 2023
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-1	4	648	June 27, 2022
C3W1-Assignment -> too much words in vocab and wrong shape Natural Language Processing in TensorFlow	12	434	January 25, 2024
Problem with remove_stopwords( ) Natural Language Processing in TensorFlow week-1	1	537	February 13, 2023
C3W1: removing stopwords query Natural Language Processing in TensorFlow week-1	1	527	January 7, 2023

C3-W1 Count Errors

Related topics