My fit_tokenizer() function seems to be returning the wrong number of words in the word_index.
len(word_index) is 29608 words instead of the expected 29714. I have verified that my remove_stopwords()
is working correctly. Even if it were not, I should get more words rather than less. Is there some way to get the reference dict of words so that I can compare, or does anyone have any other ideas for debugging this?
@balaji.ambresh , once again, thank you so much for your extremely prompt and thorough help.
I diffed the keys and am quite confused now - the keys in the list that you so kindly sent include the stopwords. However the input parameter: sentences is assigned by the call to parse_data_from_file which specifically is supposed to strip the stopwords. Shouldn’t the stopwords have been removed from the word_index?
So the stopwords that were included in the file that you sent were just the remnants of hyphenated words that were split apart by the tokenizer. That makes it clear.