C3W1: fit_token

Hello,

I would love some help with this problem. Any help would be appreciated.

Thanks,
David

Failed test case: incorrect word_index when using sample sentences (showing changed values with respect to correct answer).
Expected:
{},
but got:
Item root[‘project’] removed from dictionary.
Item root[‘phone’] removed from dictionary.
Item root[‘dan’] removed from dictionary.
Item root[‘julia’] removed from dictionary.
Item root[‘about’] removed from dictionary.
Item root[‘team’] removed from dictionary.
Item root[‘discuss’] removed from dictionary.
Item root[‘recording’] removed from dictionary.
Item root[‘rushed’] removed from dictionary.
Item root[‘market’] removed from dictionary.
Item root[‘1960s’] removed from dictionary.
Item root[‘colchester’] removed from dictionary.
Item root[‘110m’] removed from dictionary.
Item root[‘still’] removed from dictionary…

Please click my name and message your notebook as an attachment.

Here’s a hint:
When num_words is set to an integer, it’s resposible for limiting the number of terms to consider when encoding a sentence.

Every invocation of Tokenizer.fit_on_texts updates the vocabulary from the sentences passed to it. When Tokenizer.texts_to_sequences is invoked, only the most frequent num_words - 1 are encoded with their values from tokenizer.word_index.

When oov_token is specified, oov_token is used to encode words that didn’t make the top num_words - 2 rank. oov_token is made part of final encoding vocabulary subset in this case.

Example with oov_token:

texts = ["hello world again",
         "hello world again today"]
tokenizer = Tokenizer(num_words = 3, oov_token="OOVTOKEN")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)
print(tokenizer.word_index)

output:

[[2, 1, 1], [2, 1, 1, 1]]
{'OOVTOKEN': 1, 'hello': 2, 'world': 3, 'again': 4, 'today': 5}

When there is no oov_token, the common num_words - 1 words are encoded. Rest of them are ignored.
Example:

texts = ["hello world again",
         "hello world again today"]
tokenizer = Tokenizer(num_words = 3)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)
print(tokenizer.word_index)

output:

[[1, 2], [1, 2]]
{'hello': 1, 'world': 2, 'again': 3, 'today': 4}

Note: Id allocation of words starts from 1 when no oov_token is specified and from 2 when oov_token is specified. Words hello, world and again have the same counts across both the sentences. Since hello was encountered before the other words, hello gets the id 2 and has higher priority when breaking ties. The same rule breaks ties between words world and again in the examples above.

1 Like

Hi,

I have tried different combinations of the OOV token and num_words, but I didn’t understand it as well as you’ve explained it here. I am going to apply your advice and let you know what happens. Thanks a lot for your help!

Hi Balaji,

Thanks again for your help and that fantastic, under-the-hood explanation of the tokenizer. I really did not understand it so well before.

So the problem was actually simpler than that though. I did not actually fit the tokenizer in the fit_tokenizer function. But your help did come in handy because I was getting the wrong number of words so that would have caused a problem eventually.

Thanks again, super guy!!

Best,
David