C3W2_Assignment graded function preprocess_dataset

link to classroom:

Hi Folks,
i cannot follow what is being asked for in # GRADED FUNCTION: preprocess_dataset. I note there is a lot of discussion about this exercise already, but i have not found answers there. i have done the course again, looked at all the ungraded labs again. Where would i find a video or document that better describes the following args, and return type, and what their purpose is? I dont understand what preprocess dataset means.
i also cannot follow how tf.keras.layers.TextVectorization is called like a method, and at other times is sent as an argument to another method. (but this may be my lack of understanding of method chaining in python)

Args:
    dataset (tf.data.Dataset): dataset to preprocess
    text_vectorizer (tf.keras.layers.TextVectorization ): text vectorizer
    label_encoder (tf.keras.layers.StringLookup): label encoder

Returns:
    tf.data.Dataset: transformed dataset

i thought i should not be using .adapt() here because the type is a dataset (and not a list of strings) . But i see comments about using this function.

Basically where should i start, if i had to learn everything associated with this function? i am not getting the answer from doing the W2 course or looking at the ungraded labs, and there are not enough examples on the keras website. So basically i find myself on stackoverflow again, or trying to use copilot. And using copilot kind of defeats the purpose of trying to learn the basics of machine learning. Thanks in advance!

One more comment. i cannot follow the english in this hint:

You can apply the preprocessing to each pair or text and label by using the .map method

What does it mean ? -in particular the highlighted section. Could this be a typo?

hi @Cormac_Garvey

you selected incorrect category for the course specialisation. you selected NLP specialisation where as your query is from Tensorflow developer professional specialisation

I have moved it to the right category

1 Like

does this comment helpful you understand on how to write codes for preprocess data?

let me know if it doesn’t!! @Cormac_Garvey

1 Like

Thanks Deepti, i got passed that section in the end. However i get stuck a little bit later. cell 39 expects a shape of (32,120) for both training and validation batches. My shape is (32,) ?! Up to this point ,all tests have passed, but cell 40 unittests fails with:

Failed test case: Got wrong data type for the preprocessed texts.
Expected: int64
Got: object
Failed test case: Got wrong data type for the preprocessed labels.
Expected: int64
Got: object
Failed test case: Got wrong shape for the preprocessed texts. Make sure that MAX_LENGTH is set to 120 before submitting.
Expected: (32, 120)
Got: (32,)

Many thanks :pray:

can you share screenshot of your codes by personal DM.

click on my name and then message.

in the train val dataset, when you divided the dataset, you make sure you used int to the len function when you are assigning to the dataset

hi @Cormac_Garvey

You seem to have edited parts of codes or added extra codes in train_val_dataset.

In the step split the sentence and labels into train and val dataset, you do not require that two codes lines where you recall total elements and assigned to train and val.

the text and labels to train and val split was enough (the last four code lines are correct.

Next correction required is in preprocess data where you used adapt function which is not required just use dataset.map. also read the pinned comment where it tell dataset.map is used to the lambda where the function text_vectorizer for text and label_encoder for labels are used.

Refer the pinned comment in this post, to do the corrections.

Please make sure not to hard card any of the path, and write codes according to the given instructions in the assignment always.

I would suggest to get a fresh copy and re do assignment from beginning by making sure to not only write codes between markers ###START AND END CODE### for successful submission.

Let me know if you need more help.

Regards
DP

Thanks Deepti. What does it mean: pinned comment where it tell dataset.map etc. i cannot follow this. What is the name of the comment and i can search for it. many thanks

pinned comment means the comment link I have shared here in my previous comment. anyways I have replied in your DM.
Mild correction don’t recall lambda for text and labels separately, it is a single code line

@Cormac_Garvey

you asked me about function, so just sharing a link go through it, it is not directly related to your query but it help you understand lambda function

Regards
DP

1 Like

The reason behind dataset code to be recalled in 1 line is when you separate the two steps for text and labels, you are creating multiple or more dataset than required which can throw IOPud rate limit error when you train your model @Cormac_Garvey

1 Like

This C3W2 is incredibly confusing. There are a total of 6 graded functions out of which the last 3 just wont work. I need some sort of explanation here, because the exact same code works in the ungraded lab.

I have been struggling with the lab assignment for over a week now - and I finally give up. It makes absolute no sense to continue struggling like this.

hi @mayuriroy

please create always new topic for your issue rather than commenting on older threads, so your learning journey is saved in your log.

Also I checked your other post, you probably might have hard coded the dataset resulting in failed grading, you also have failed create model, so check if you haven’t used global variables instead of Local variables.

I think another mentor has told you to dm your codes, he will surely guide you further.