C3W2 Lab1: vocab_size and reverse_word_index length


I have some doubt about how vocab_size works:

How we can see, we have this scenario:

vocab_size = 10000
embedding_weights = 10000
reverse_word_index = 88583

in reverse word index there are more words that vocab_size. Reverse_word_index is defined as:

reverse_word_index = tokenizer.index_word

Tokenizer is fitted on the corpus of training sentences. So, there is nothing strange that there are more words in teh corpus that “num_words” parameters… in this case 88583 vs 10000.

My question is about how the words are chosen in order to be included in embedding, as there are 88583 - 10000 not included.

Hello @fdam ,

Send me your notebook via dm such that I can check where it went wrong. By clicking on the profile picture, you will see an option to message. There you can attach your notebook. Then we can discuss the issues here, under the topic you created.

With regards,
Nilosree Sengupta

Hi @nilosreesengupta and thank you for your reply.

C3_W2_Lab_1_imdb.ipynb (18.3 KB)

I attached the notebok but actually there is nothing wrong: it seems work fine.

I have only some doubts about the differences between the length of setted vocab_size (10000) and the length of reverse_word_index (88583) and how the 1000 words are picked up in order to match in the tokenizer embedding_weights (with length 10000).

Thank you for your help.

Hello @fdam,

vocab_size :- the length of most frequent unique words in the vocabulary in the corpus.
vocabulary :- the above most frequent words in the vocabulary, discarding the remaining tokens as OOV_token. Their weights → embedding weights
reverse_word_index :- consists of all unique words from the corpus, also consisting of the OOV token → vocabulary+oov_token
So reversed_word_index is different and greater than vocab_size.

For selection, the Tokenizer looks for the most frequent words in the vocabulary.

Hope this helps.

WIth regards,
Nilosree Sengupta

1 Like