C3: W1: Lab1: Tokenizer number of words!

Why the tokenizer creates a dictionary with five words although the num_words = 3?

Any ideas?

image

One use of keeping track of all words in word_index is so that you can call fit_on_texts multiple times to update the tokenizer to combine multiple data sources. Keep in mind that indices are assigned to words based on their counts. The word with highest count gets id 1.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3)
sentences = [
    'love my dog',
    'love my cat'
]
tokenizer.fit_on_texts(sentences) # {'love': 1, 'my': 2, 'dog': 3, 'cat': 4}
print(tokenizer.word_index)
more_sentences = [
    'I wrote I I I to mean 3'
]
tokenizer.fit_on_texts(more_sentences)
print(tokenizer.word_index) # {'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5, 'wrote': 6, 'to': 7, 'mean': 8, '3': 9}