C3W1-Assignment -> too much words in vocab and wrong shape

RoWe84 · January 24, 2024, 7:53am

Hello everyone,
i got stuck within the first assignment. The parse_data_from_file function works as expected but the fit_tokenizer gave me an output of

Vocabulary contains 30888 words

token included in vocabulary

Inside the function the Tokenizer gets only the OOV parameter.

As a consequence the next function get_padded_sequences gave me this output.

First padded sequence looks like this:

[ 0 0 0 … 87 87 8]

Numpy array of all sequences has shape: (2225, 2439)

This means there are 2225 sequences in total and each one has a size of 2439

With a mismatch in the number sequence and in the desired size of the array (2439 instead of 2438). Even if i try to do a post padding the numbers are not correct and obviously no change in shape.

Can someone give me a hint where to look at?

Many thanks in advance!

Deepti_Prasad · January 24, 2024, 9:00am

Hello @RoWe84,

Can you share the output you got for parse data from file grader cell?

Also make sure for your fit_tokenizer grader cell, if you have recalled the fit on the sentence code correctly.

Regards
DP

RoWe84 · January 24, 2024, 10:00am

Thanks for your fast reply!

I’ve added the output as pictures.

Haven’t seen something wrong honestly.

Deepti_Prasad · January 24, 2024, 10:39am

fit_tokenizer output image please

RoWe84 · January 24, 2024, 11:05am

DLTF-C3W1-3

Deepti_Prasad · January 24, 2024, 11:11am

how did you recall fit in the sentence code line

RoWe84 · January 24, 2024, 11:25am

After creating a tokenizer instance with the OOV parameter i basically called

{Correct Code: Removed by Moderater}

Deepti_Prasad · January 24, 2024, 11:30am

How did you instantiate tokenizer class

Also check if in your parse data codes, if you have remove_stopwords for both labels(row[0]) and sentences (row[1]) as per row

RoWe84 · January 24, 2024, 12:07pm

The tokenizer class was instantiated this way

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token=“”) // the OOV is not displayed but in the code.

I’ve checked the parse data function. I called the remove stopwords only for the sentences and not for the labels. Just for testing i added the function call to the labels with no effect on the vocab size nor the shape.

RoWe84 · January 24, 2024, 12:09pm

hard to answer without posting lines of code. Sorry for that!

Deepti_Prasad · January 24, 2024, 12:36pm

Can you share an image of parse data grader cell codes as well as the first grader code cell via personal DM. Click on name and then message.

I am more certain issue could be in these cells.

Also the instantiate tokenizer describes you to use oov_token=’ '(here the OOV need to be used with < >
I hope you have used the same.

RoWe84 · January 25, 2024, 5:55am

Problem solved - you pointed me into the right direction. Many thanks!

The issue was in the parse data grader cell.

If i would have read the instructions carefully it would have saved me some time.

I’ve used row[1:] for building the sentences which was incorrect.

After changing the rest was argument parsing.

Deepti_Prasad · January 25, 2024, 10:20am

Hello @RoWe84,

Mistakes and debugging is a part of programming and you catched the part with hints provided and corrected the issue.

Happy to help!!!

Keep learning!!!

Regards
DP

Topic		Replies	Views
C3W1 incorrect word count from fit_tokenizer() function Natural Language Processing in TensorFlow week-1	6	475	December 9, 2023
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-1	4	646	June 27, 2022
C3-W1 Count Errors Natural Language Processing in TensorFlow week-1	3	681	July 28, 2022
C3W3 Assignment - Sequences and Padding Natural Language Processing in TensorFlow week-3	4	266	June 7, 2023
TF Dev specialization, Course-3,wk-1, fit_tokenizer(): Natural Language Processing in TensorFlow week-1	3	503	February 21, 2023

C3W1-Assignment -> too much words in vocab and wrong shape

Related topics