Num_words arg in Tokenizer

Ik this was explained in the lecture but i didn’t get it fully. let’s say num_words = 100, the tokenizer documentation says that “the maximum number of words to keep” is chosen based on the highest freq. When we call word_index I can reach more than 100 tokens. Based on this I assume there are two sets of tokens. One is len = 100, which are the top 100 most frequent words, and the other is all the tokens for all the unique words in the training set which is given by the word_index attribute.

When we use the above tokenizer for a set of test_sentences using the texts_to_sequences method, does it use the set of 100 tokens to make sequences or the set returned by the word_index? I assume it should be the set of 100 tokens.

num_words parameter keeps the most frequent num_words - 1 as the vocabulary when it comes to encoding words to integers. If an OOV token is specified, only the most frequent num_words - 2 words are integer encoded and the rest are treated as OOV.
In case of a tie, word with smaller index wins.
See example below to notice the words this, is, a and day have the same frequencies i.e. 2. However, if you choose to have only the top 2 frequent words, day and a will be left out since they have higher word indexes than this and is.

Consider the following example:

>>> sentences = ['This is a rainy day','This day is a windy']
>>> tokenizer = Tokenizer(num_words=4, oov_token='<OOV>')
>>> tokenizer.fit_on_texts(sentences)
>>> tokenizer.word_counts
OrderedDict([('this', 2), ('is', 2), ('a', 2), ('rainy', 1), ('day', 2), ('windy', 1)])
>>> tokenizer.word_index
{'<OOV>': 1, 'this': 2, 'is': 3, 'a': 4, 'day': 5, 'rainy': 6, 'windy': 7}
>>> tokenizer.texts_to_sequences(sentences)
[[2, 3, 1, 1, 1], [2, 1, 3, 1, 1]]
>>> 

Attributes such as word_index, word_counts hold details about all the words encountered so far. This becomes useful when you want to fit on multiple sources of texts. Here’s an example:

>>> dataset1 = [
...     'This is a short text',
... ]
>>> dataset2 = [
...     'This is a longer text with more text tokens'
... ]
>>> tokenizer = Tokenizer(num_words=4, oov_token='<OOV>')
>>> tokenizer.fit_on_texts(dataset1)
>>> tokenizer.texts_to_sequences(dataset1 + dataset2)
[[2, 3, 1, 1, 1], [2, 3, 1, 1, 1, 1, 1, 1, 1]]
>>> tokenizer.word_index
{'<OOV>': 1, 'this': 2, 'is': 3, 'a': 4, 'short': 5, 'text': 6}
>>> tokenizer.word_counts
OrderedDict([('this', 1), ('is', 1), ('a', 1), ('short', 1), ('text', 1)])
>>> tokenizer.fit_on_texts(dataset2)
>>> tokenizer.word_index
{'<OOV>': 1, 'text': 2, 'this': 3, 'is': 4, 'a': 5, 'short': 6, 'longer': 7, 'with': 8, 'more': 9, 'tokens': 10}
>>> tokenizer.texts_to_sequences(dataset1 + dataset2)
[[3, 1, 1, 1, 2], [3, 1, 1, 1, 2, 1, 1, 2, 1]]
>>> tokenizer.word_counts
OrderedDict([('this', 2), ('is', 2), ('a', 2), ('short', 1), ('text', 3), ('longer', 1), ('with', 1), ('more', 1), ('tokens', 1)])

See how the word text becomes frequent since it’s encountered 3 times across multiple calls to the tokenizer.