C3: W1: Lab1: Tokenizer number of words!

xaid · June 21, 2022, 2:46pm

Why the tokenizer creates a dictionary with five words although the num_words = 3?

Any ideas?

balaji.ambresh · June 21, 2022, 5:03pm

One use of keeping track of all words in word_index is so that you can call fit_on_texts multiple times to update the tokenizer to combine multiple data sources. Keep in mind that indices are assigned to words based on their counts. The word with highest count gets id 1.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3)
sentences = [
    'love my dog',
    'love my cat'
]
tokenizer.fit_on_texts(sentences) # {'love': 1, 'my': 2, 'dog': 3, 'cat': 4}
print(tokenizer.word_index)
more_sentences = [
    'I wrote I I I to mean 3'
]
tokenizer.fit_on_texts(more_sentences)
print(tokenizer.word_index) # {'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5, 'wrote': 6, 'to': 7, 'mean': 8, '3': 9}

Topic		Replies	Views
Num_words arg in Tokenizer Natural Language Processing in TensorFlow week-module-1	1	555	October 26, 2023
C3W1: fit_token Natural Language Processing in TensorFlow week-module-1	4	551	November 16, 2022
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-module-1	4	653	June 27, 2022
TF Dev specialization, Course-3,wk-1, fit_tokenizer(): Natural Language Processing in TensorFlow week-module-1	3	516	February 21, 2023
C3W1 incorrect word count from fit_tokenizer() function Natural Language Processing in TensorFlow week-module-1	6	484	December 9, 2023

C3: W1: Lab1: Tokenizer number of words!

Related topics