C3W2 Lab1: vocab_size and reverse_word_index length

fdam · April 30, 2023, 3:18pm

Hi,

I have some doubt about how vocab_size works:

How we can see, we have this scenario:

vocab_size = 10000
embedding_weights = 10000
reverse_word_index = 88583

in reverse word index there are more words that vocab_size. Reverse_word_index is defined as:

reverse_word_index = tokenizer.index_word

Tokenizer is fitted on the corpus of training sentences. So, there is nothing strange that there are more words in teh corpus that “num_words” parameters… in this case 88583 vs 10000.

My question is about how the words are chosen in order to be included in embedding, as there are 88583 - 10000 not included.

nilosreesengupta · April 30, 2023, 5:24pm

Hello @fdam ,

Send me your notebook via dm such that I can check where it went wrong. By clicking on the profile picture, you will see an option to message. There you can attach your notebook. Then we can discuss the issues here, under the topic you created.

With regards,
Nilosree Sengupta

fdam · May 2, 2023, 8:21am

Hi @nilosreesengupta and thank you for your reply.

C3_W2_Lab_1_imdb.ipynb (18.3 KB)

I attached the notebok but actually there is nothing wrong: it seems work fine.

I have only some doubts about the differences between the length of setted vocab_size (10000) and the length of reverse_word_index (88583) and how the 1000 words are picked up in order to match in the tokenizer embedding_weights (with length 10000).

Thank you for your help.

nilosreesengupta · May 7, 2023, 8:45pm

Hello @fdam,

vocab_size :- the length of most frequent unique words in the vocabulary in the corpus.
vocabulary :- the above most frequent words in the vocabulary, discarding the remaining tokens as OOV_token. Their weights → embedding weights
reverse_word_index :- consists of all unique words from the corpus, also consisting of the OOV token → vocabulary+oov_token
So reversed_word_index is different and greater than vocab_size.

For selection, the Tokenizer looks for the most frequent words in the vocabulary.

Hope this helps.

WIth regards,
Nilosree Sengupta

Topic		Replies	Views
W2 Assignment clarification - "vocab_size" NLP with Sequence Models week-2	4	521	September 19, 2022
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-1	4	648	June 27, 2022
C3W3 - Cannot get pass the loss slope test Natural Language Processing in TensorFlow week-3	7	378	January 9, 2024
Test vocab size mismatch for Exercise 1 Updated NER assignment NLP with Sequence Models week-2	10	437	January 18, 2024
Input dimension for Embedding Layer Week1 Assignment NLP with Attention Models week-1	2	294	January 8, 2024

C3W2 Lab1: vocab_size and reverse_word_index length

Related topics