C3W2 Assignment - preprocess_dataset

danidrimbe · April 4, 2025, 8:23pm

Hello,

It seems that I have difficulties in finishing this lab. This time I need some help to finish Exercise 4.

Inside the function preprocess_dataset() I proceed as follow:

please do not share any part of the grades codes as it is against community guidelines. You can always share screenshot of the error you encountered. if a mentor wants to see your codes they will ask for it to send by DM

The error that I get is the following:
ValueError: Exception encountered when calling TextVectorization.call().

Failed to convert a NumPy array to a Tensor (Unsupported object type _MapDataset).

Arguments received by TextVectorization.call():
• inputs=<_MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

Could someone point me out which steps need correction? My guess is step 2 is incorrect, but I do not know how to proceed. Maybe less important, but why did the lab started with dataset=dataset=NONE repeating the word dataset twice?

Deepti_Prasad · April 5, 2025, 4:28am

hi @danidrimbe

Daniel your issue is you have hard-coded the step to create the preprocess data. you didn’t require the point2 and point 3 step, rather merge your code step 1, 2 and 3 in one step when you recall the lambda for text and label, use respective function for text and label to complete the dataset step.

but your error is actually pointing the issue is not with these codes, but instead your text are not tensor form, so I would actually like to see your fit label encoder as well as text codes.

Please click on my name and then send screenshots of the codes you wrote for previous grade function cells where you wrote codes.

also send screenshot of train_val_dataset.

Deepti_Prasad · April 8, 2025, 6:36am

you have done too much editing and addition of codes in places where you weren’t suppose to.

Please get a fresh copy and re-do your assignment. This time make sure only to replace the None place between the markers ###START AND END CODE HERE### and not add any extra code line.

Issue with train val dataset.

1a.determination of train_size is not by numerical value but use the argument data to the TRAINING_SPLIT(Make sure use the int function as the instructions mentions the number of sentences used in training should be integer.

1b.Next to get slicing for text only and label only you are suppose to use data and mention the position of its relevance in column position and not mention row positioning.

1c. while splitting the sentences and labels into train and val splits, your codes for train splits for text and labels are incorrect as you have mentioned it with it index positioning which was not required.

Issues with fit_label encoder
You have added extra code lines than required with def decode labels.
issue with preprocess data
This grade function had only two codes lines with dataset to be written one for the lambda function for text and label and another code line for the batch size. other codes lines you added are not required.

remember when you use the lambda function with text and label, use their respective function (which you did but in different steps) also you are suppose to create dataset for text and label, and not text_dataset and label_dataset, these two are incorrect code argument.

the simplest way to recall the first dataset code. Use the lambda function to the map function. when you use the lambda function for text, label, mention its respective function. make sure to place the tuples correctly as most of the time missing trailing tuple have also caused error.

Regards
DP

danidrimbe · April 9, 2025, 3:17am

Dear DP,

Thank you a lot for your explanations and suggestions. My code now works. However, I have a few more questions and I would be very happy if you can clarify me a few points.

You mentioned (in 1c) that my code for train split for text and labers is incorrect. How should I correct that or what is a better way to do it? Also, I have passed the tests as it is.
Regarding Exercise 4, I used the dataset.map(lambda…) command. Where can I find the precise syntax for this since I do not remember seeing this in the videos?
In the fit_vectorizer function we use tf.keras.layers.TextVectorization() and in the fit_label_encoder function we use tf.keras.layers.StringLookup(). What is the difference between these 2 functions? Also, why is it needed to use the fit_label_encoder function since the number of labels is in general small? ( this is not necessarily the case for the number of words in the text). If we really need to encode the labels, why don’t we use again tf.keras.layers.TextVectorization()?

Thank you a lot in advance for your answers.

Daniel

Deepti_Prasad · April 9, 2025, 6:33am

the 1c point mentions you not to use index position number, if you remove that it should be fine

for dataset, the first statement code is correct. also kindly remove that point as it is part of grades code.

for your 3rd point question, my simple explanation would be fit_vectorizer using textvectorization to adapt to the training sentences where fit label encoder is encoding the labels. both functions are addressing two different categories text and label. Even if you have less number of labels you need encode the labels. Remember the first code from train_val_set for train_size where you convert the number of sentences into integer? for this reason labels are encoded, so when dataset is trained for their respective categories i.e. text and labels, each are able to determine each other based on numerical data.

danidrimbe · April 9, 2025, 2:53pm

Thank you for your answers. Regarding the 1c point, I do not see why I should remove the index position number in train_texts = texts[0:train_size] and still be ok.

As for my 3rd point question, your explanation makes sense. However, in Ungraded Lab: Training a binary classifier with the Sarcasm Dataset (Lab 2 of week 2) the labels are not encoded. More precisely, train_dataset_final is defined by tf.data.Dataset.from_tensor_slices((train_sequences,train_labels)), but only train_sequences contains the vectorize_layer. The reason of this is because labels are numbers here while in the graded lab of week 2, the labels are words?

Deepti_Prasad · April 9, 2025, 4:46pm

yes labels are words

Topic		Replies	Views
C3W2 - Issue with Exercise 4: preprocess_dataset & Exercise 5: create_model & create_mode_and_check_accuracy Natural Language Processing in TensorFlow week-module-2	7	121	December 21, 2024
C3W2_Assignment Error Natural Language Processing in TensorFlow week-module-2 , ai-discussions , project	5	72	January 5, 2025
C3_W2_Assignment_Preprocess_Data Natural Language Processing in TensorFlow week-module-2	5	25	March 25, 2025
Need help on excersie 4 of c3w2 assignment Natural Language Processing in TensorFlow	2	83	September 27, 2024
Week 2: Diving deeper into the BBC News archive Natural Language Processing in TensorFlow week-module-2	1	20	May 26, 2025

C3W2 Assignment - preprocess_dataset

Related topics