C3 W1 assignment: Vocabulary contains 29608 words instead of 29714

jpgeek · June 27, 2022, 10:57am

My fit_tokenizer() function seems to be returning the wrong number of words in the word_index.

len(word_index) is 29608 words instead of the expected 29714. I have verified that my
remove_stopwords()
is working correctly. Even if it were not, I should get more words rather than less. Is there some way to get the reference dict of words so that I can compare, or does anyone have any other ideas for debugging this?

balaji.ambresh · June 27, 2022, 12:53pm

Here you go:
word_index.js (516.3 KB)

jpgeek · June 27, 2022, 4:34pm

@balaji.ambresh , once again, thank you so much for your extremely prompt and thorough help.

I diffed the keys and am quite confused now - the keys in the list that you so kindly sent include the stopwords. However the input parameter: sentences is assigned by the call to parse_data_from_file which specifically is supposed to strip the stopwords. Shouldn’t the stopwords have been removed from the word_index?

balaji.ambresh · June 27, 2022, 5:32pm

That is correct. We remove stopwords from sentences.

Tensorflow Tokenizer does tokenization differently:
It strips special characters like -. Here’s an example:

ids = tokenizer.texts_to_sequences(['high-definition tv a-la-carte'])[0]
print(ids)
print([tokenizer.index_word[id] for id in ids])

# output
[9, 10, 2, 139, 140, 141]
['high', 'definition', 'tv', 'a', 'la', 'carte']

You can set the filters field in Tokenizer to customize the behavior.

jpgeek · June 27, 2022, 11:27pm

Thanks again.

So the stopwords that were included in the file that you sent were just the remnants of hyphenated words that were split apart by the tokenizer. That makes it clear.

Topic		Replies	Views
Too many words in Vocabulary for Tokenizer: TF Course 3 W1 Assignment Natural Language Processing in TensorFlow week-module-1	5	576	September 16, 2022
C3W1 incorrect word count from fit_tokenizer() function Natural Language Processing in TensorFlow week-module-1	6	487	December 9, 2023
C3-W1 Count Errors Natural Language Processing in TensorFlow week-module-1	3	686	July 28, 2022
Problem with remove_stopwords( ) Natural Language Processing in TensorFlow week-module-1	1	550	February 13, 2023
C3W1 Vocabulary tests as too small Natural Language Processing in TensorFlow week-module-1	1	554	October 2, 2022

C3 W1 assignment: Vocabulary contains 29608 words instead of 29714

Related topics