in reverse word index there are more words that vocab_size. Reverse_word_index is defined as:
reverse_word_index = tokenizer.index_word
Tokenizer is fitted on the corpus of training sentences. So, there is nothing strange that there are more words in teh corpus that “num_words” parameters… in this case 88583 vs 10000.
My question is about how the words are chosen in order to be included in embedding, as there are 88583 - 10000 not included.
Send me your notebook via dm such that I can check where it went wrong. By clicking on the profile picture, you will see an option to message. There you can attach your notebook. Then we can discuss the issues here, under the topic you created.
I attached the notebok but actually there is nothing wrong: it seems work fine.
I have only some doubts about the differences between the length of setted vocab_size (10000) and the length of reverse_word_index (88583) and how the 1000 words are picked up in order to match in the tokenizer embedding_weights (with length 10000).
vocab_size :- the length of most frequent unique words in the vocabulary in the corpus.
vocabulary :- the above most frequent words in the vocabulary, discarding the remaining tokens as OOV_token. Their weights → embedding weights
reverse_word_index :- consists of all unique words from the corpus, also consisting of the OOV token → vocabulary+oov_token
So reversed_word_index is different and greater than vocab_size.
For selection, the Tokenizer looks for the most frequent words in the vocabulary.