C3 W1 assignment: Vocabulary contains 29608 words instead of 29714

My fit_tokenizer() function seems to be returning the wrong number of words in the word_index.

len(word_index) is 29608 words instead of the expected 29714. I have verified that my
remove_stopwords()
is working correctly. Even if it were not, I should get more words rather than less. Is there some way to get the reference dict of words so that I can compare, or does anyone have any other ideas for debugging this?

Here you go:
word_index.js (516.3 KB)

1 Like

@balaji.ambresh , once again, thank you so much for your extremely prompt and thorough help.

I diffed the keys and am quite confused now - the keys in the list that you so kindly sent include the stopwords. However the input parameter: sentences is assigned by the call to parse_data_from_file which specifically is supposed to strip the stopwords. Shouldn’t the stopwords have been removed from the word_index?

That is correct. We remove stopwords from sentences.

Tensorflow Tokenizer does tokenization differently:
It strips special characters like -. Here’s an example:

ids = tokenizer.texts_to_sequences(['high-definition tv a-la-carte'])[0]
print(ids)
print([tokenizer.index_word[id] for id in ids])

# output
[9, 10, 2, 139, 140, 141]
['high', 'definition', 'tv', 'a', 'la', 'carte']

You can set the filters field in Tokenizer to customize the behavior.

Thanks again.

So the stopwords that were included in the file that you sent were just the remnants of hyphenated words that were split apart by the tokenizer. That makes it clear.