C3W1-Assignment -> too much words in vocab and wrong shape

Hello everyone,
i got stuck within the first assignment. The parse_data_from_file function works as expected but the fit_tokenizer gave me an output of

Vocabulary contains 30888 words

token included in vocabulary

Inside the function the Tokenizer gets only the OOV parameter.

As a consequence the next function get_padded_sequences gave me this output.

First padded sequence looks like this:

[ 0 0 0 … 87 87 8]

Numpy array of all sequences has shape: (2225, 2439)

This means there are 2225 sequences in total and each one has a size of 2439

With a mismatch in the number sequence and in the desired size of the array (2439 instead of 2438). Even if i try to do a post padding the numbers are not correct and obviously no change in shape.

Can someone give me a hint where to look at?

Many thanks in advance!

Hello @RoWe84,

Can you share the output you got for parse data from file grader cell?

Also make sure for your fit_tokenizer grader cell, if you have recalled the fit on the sentence code correctly.


Thanks for your fast reply!

I’ve added the output as pictures.

Haven’t seen something wrong honestly.

fit_tokenizer output image please


how did you recall fit in the sentence code line

After creating a tokenizer instance with the OOV parameter i basically called

{Correct Code: Removed by Moderater}

How did you instantiate tokenizer class

Also check if in your parse data codes, if you have remove_stopwords for both labels(row[0]) and sentences (row[1]) as per row

The tokenizer class was instantiated this way

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token=“”) // the OOV is not displayed but in the code.

I’ve checked the parse data function. I called the remove stopwords only for the sentences and not for the labels. Just for testing i added the function call to the labels with no effect on the vocab size nor the shape.

hard to answer without posting lines of code. Sorry for that!

Can you share an image of parse data grader cell codes as well as the first grader code cell via personal DM. Click on name and then message.

I am more certain issue could be in these cells.

Also the instantiate tokenizer describes you to use oov_token=’ '(here the OOV need to be used with < >
I hope you have used the same.

Problem solved - you pointed me into the right direction. Many thanks!

The issue was in the parse data grader cell.

If i would have read the instructions carefully it would have saved me some time.

I’ve used row[1:] for building the sentences which was incorrect.

After changing the rest was argument parsing.

Hello @RoWe84,

Mistakes and debugging is a part of programming and you catched the part with hints provided and corrected the issue.

Happy to help!!!

Keep learning!!!