C3W2_Assignment graded function preprocess_dataset

Cormac_Garvey · October 25, 2024, 12:05pm

Hi Folks,
i cannot follow what is being asked for in # GRADED FUNCTION: preprocess_dataset. I note there is a lot of discussion about this exercise already, but i have not found answers there. i have done the course again, looked at all the ungraded labs again. Where would i find a video or document that better describes the following args, and return type, and what their purpose is? I dont understand what preprocess dataset means.
i also cannot follow how tf.keras.layers.TextVectorization is called like a method, and at other times is sent as an argument to another method. (but this may be my lack of understanding of method chaining in python)

Args:
    dataset (tf.data.Dataset): dataset to preprocess
    text_vectorizer (tf.keras.layers.TextVectorization ): text vectorizer
    label_encoder (tf.keras.layers.StringLookup): label encoder

Returns:
    tf.data.Dataset: transformed dataset

i thought i should not be using .adapt() here because the type is a dataset (and not a list of strings) . But i see comments about using this function.

Basically where should i start, if i had to learn everything associated with this function? i am not getting the answer from doing the W2 course or looking at the ungraded labs, and there are not enough examples on the keras website. So basically i find myself on stackoverflow again, or trying to use copilot. And using copilot kind of defeats the purpose of trying to learn the basics of machine learning. Thanks in advance!

Cormac_Garvey · October 25, 2024, 12:10pm

One more comment. i cannot follow the english in this hint:

You can apply the preprocessing to each pair or text and label by using the .map method

What does it mean ? -in particular the highlighted section. Could this be a typo?

Deepti_Prasad · October 25, 2024, 12:35pm

hi @Cormac_Garvey

you selected incorrect category for the course specialisation. you selected NLP specialisation where as your query is from Tensorflow developer professional specialisation

I have moved it to the right category

Deepti_Prasad · October 25, 2024, 12:38pm

does this comment helpful you understand on how to write codes for preprocess data?

let me know if it doesn’t!! @Cormac_Garvey

Cormac_Garvey · October 29, 2024, 10:26am

Thanks Deepti, i got passed that section in the end. However i get stuck a little bit later. cell 39 expects a shape of (32,120) for both training and validation batches. My shape is (32,) ?! Up to this point ,all tests have passed, but cell 40 unittests fails with:

Failed test case: Got wrong data type for the preprocessed texts.
Expected: int64
Got: object
Failed test case: Got wrong data type for the preprocessed labels.
Expected: int64
Got: object
Failed test case: Got wrong shape for the preprocessed texts. Make sure that MAX_LENGTH is set to 120 before submitting.
Expected: (32, 120)
Got: (32,)

Many thanks

Deepti_Prasad · October 29, 2024, 10:35am

can you share screenshot of your codes by personal DM.

click on my name and then message.

in the train val dataset, when you divided the dataset, you make sure you used int to the len function when you are assigning to the dataset

Deepti_Prasad · October 30, 2024, 1:52am

hi @Cormac_Garvey

You seem to have edited parts of codes or added extra codes in train_val_dataset.

In the step split the sentence and labels into train and val dataset, you do not require that two codes lines where you recall total elements and assigned to train and val.

the text and labels to train and val split was enough (the last four code lines are correct.

Next correction required is in preprocess data where you used adapt function which is not required just use dataset.map. also read the pinned comment where it tell dataset.map is used to the lambda where the function text_vectorizer for text and label_encoder for labels are used.

Refer the pinned comment in this post, to do the corrections.

Please make sure not to hard card any of the path, and write codes according to the given instructions in the assignment always.

I would suggest to get a fresh copy and re do assignment from beginning by making sure to not only write codes between markers ###START AND END CODE### for successful submission.

Let me know if you need more help.

Regards
DP

Cormac_Garvey · October 31, 2024, 10:20am

Thanks Deepti. What does it mean: pinned comment where it tell dataset.map etc. i cannot follow this. What is the name of the comment and i can search for it. many thanks

Deepti_Prasad · October 31, 2024, 10:34am

pinned comment means the comment link I have shared here in my previous comment. anyways I have replied in your DM.
Mild correction don’t recall lambda for text and labels separately, it is a single code line

Deepti_Prasad · October 31, 2024, 10:39am

@Cormac_Garvey

you asked me about function, so just sharing a link go through it, it is not directly related to your query but it help you understand lambda function

Regards
DP

Deepti_Prasad · October 31, 2024, 11:39am

The reason behind dataset code to be recalled in 1 line is when you separate the two steps for text and labels, you are creating multiple or more dataset than required which can throw IOPud rate limit error when you train your model @Cormac_Garvey

mayuriroy · December 20, 2024, 11:40am

This C3W2 is incredibly confusing. There are a total of 6 graded functions out of which the last 3 just wont work. I need some sort of explanation here, because the exact same code works in the ungraded lab.

I have been struggling with the lab assignment for over a week now - and I finally give up. It makes absolute no sense to continue struggling like this.

Deepti_Prasad · December 20, 2024, 9:05pm

hi @mayuriroy

please create always new topic for your issue rather than commenting on older threads, so your learning journey is saved in your log.

Also I checked your other post, you probably might have hard coded the dataset resulting in failed grading, you also have failed create model, so check if you haven’t used global variables instead of Local variables.

I think another mentor has told you to dm your codes, he will surely guide you further.

e_d_k · March 8, 2025, 5:48pm

Hi there,
I think I am having a similar issue, I have tried to map and use the batch mechanism at the end. The dimensionality check seems to match with the expected output but later on it fails.

my code is:
dataset = dataset.map(lambda text_vectorizer, label_encoder:text_vectorizer)
dataset = (dataset
.batch(32))

Failure reason:

AttributeError Traceback (most recent call last)
Cell In[67], line 4
1 train_batch = next(train_proc_dataset.as_numpy_iterator())
2 validation_batch = next(validation_proc_dataset.as_numpy_iterator())
----> 4 print(f"Shape of the train batch: {train_batch[0].shape}“)
5 print(f"Shape of the validation batch: {validation_batch[0].shape}”)

AttributeError: ‘bytes’ object has no attribute ‘shape’

Kenneth_Brezinsky · March 26, 2025, 3:14am

I am stuck exactly at the same place with the same error. I don’t know what to do next. I have read the other posts. Help!

Deepti_Prasad · March 26, 2025, 9:43am

create a new topic with proper description of your issue.

I am closing this thread to avoid confusion.

Kenneth_Brezinsky · March 26, 2025, 3:50pm

Deepti

Thank you for answering. I finally figured it out after all.

Ken

Topic		Replies	Views
# GRADED FUNCTION: preprocess_dataset Natural Language Processing in TensorFlow	22	392	October 16, 2024
C3W2 Assignment - preprocess_dataset Natural Language Processing in TensorFlow week-module-2	6	46	April 9, 2025
Week 2: Diving deeper into the BBC News archive Natural Language Processing in TensorFlow week-module-2	1	20	May 26, 2025
Need help on excersie 4 of c3w2 assignment Natural Language Processing in TensorFlow	2	83	September 27, 2024
# GRADED FUNCTION: preprocess_dataset error - "name 'labels' is not defined" NLP with Sequence Models week-module-2	2	81	October 16, 2024

C3W2_Assignment graded function preprocess_dataset

Failure reason:

Related topics