# GRADED FUNCTION: preprocess_dataset

Thanks for your previous help. but it seems that C3W2_assment is my unlucky course as I managed to finish everything else in C3.

Now I am stuck with # GRADED FUNCTION: preprocess_dataset
(everything else before works :blush:)

For testing this GRADED FUNCTION the following code is presented:

The train_dataset argument is at that time a_TensorSliceDataset containing the Text and the Label.

When mapping the text_vectorizer to this dataset I do have two data inputs but the vectorizer expects only one? Do I have to split the _TensorSliceDataset again? If so how?

Conversely, the function label_encoder (last argument in the function preprocess_dataset) takes in two arguments:

image

Each of these arguments (train_labels, validation_labels) are about labels, one from the training set and one from the validation set. But when calling the preprocess_dataset function only one set is given:

This train_dataset contains only train text and train labels, so the function fit_label_encoder can only receive train labels – i.e. the preprocess_dataset function is lacking data for the validation_labels (it would also make no sense to have validation labels together with train labels…).

Could you please have a look at the function and give advice? Many Thanks!

Here are a couple of hints:

  1. preprocess_dataset is provided text_vectorizer and label_encoder adapted to the correct split(s) of the dataset.
  2. Each entry in train_dataset is a tuple with structure (text, label).
  3. To access a field of the tuple in each row, see this example: text_only_dataset = train_dataset.map(lambda text, label: text). Use this information apply the correct transformations and return a tuple of encoded text and encoded label.

With what other mentor mentioned, also refer the ungraded labs which will helps you on train_labels and validation_labels were used in fit_label_encoder, and why train_proc_dataset only used train_dataset.

Basically with grade function preprocess_dataset, you are creating a set of data with labels of text and labels from the fit label encoder as in that cells texts are vectorized and label encoded, so these function are normalization of data for better performance.

In the fit label encoder, you basically concatenate train labels with validation labels, then you encode the labels using these concatenated labels using [tf.keras.layers.StringLookup], making sure following the instruction to not include the oov_tokens as instruction in the grade cell mentions.
At last you fit the tokenizer to all the labels.

For the preprocess data, follow the steps what mentor has mentioned, there is also a test cell before the grade cell of preprocess data which could guide you but it is not a direct hint as in the preprocess dataset, you need to use label as set of text and label.

All the best!!!

Hope it resolves your issue.

Regards
DP

Thanks again, it worked - my error was that the label_encoder has already been instantiated previously…

so the preprocess_dataset function is not taking in the function fit_label_encoder but the instance of it… :sleeping:

You catched the hint!! :crazy_face:

:partying_face: :tada:

Even I am struggling, do we need to use vectrizer.adapt() here ?

1 Like

Hi, I’m also struggling on this section of the assignment. I have tried many variations of the commands I think are involved. This is the closed I have come. It gives the correct outcome for the immediately subsequent cell but then gives the wrong shape in the next cell.

posting grade cells codes is against community guidelines kindly refrain from posting codes, refer faq Code of Conduct

Please could someone help explain where I am going wrong?

Thanks,
Madie

You are not suppose to post codes on public post thread and whenever encounter any issue, kindly create a new post with a screenshot of the error you encountered without sharing any grade cell codes.

For better understanding, refer FAQ section Code of Conduct

Hi,
Apologies, this is the first time I have tried to do this and did not realise.

Unfortunately, I dont seem to be able to upload images either, I keep getting an error.
The issue i am having is that the shape of the batches are coming out as (32,) instead of (32,120).

Thanks,
Madie

can take a screenshot of the error you are mentioning @mallen, so I get better understanding where your codes might be going wrong.

If its a lengthy error log, you can take two separate screenshots and post. Also confirm if your previous grade cell unittest was passed fit label encoder, if not, then share the screenshot of the output you got when you run down that unittest cell.

Regards
DP

Hi,

Unfortunately I cannot seem to upload an image I keep getting errors and have tried on multiple browsers.

These are the errors I get at the unittests. All previous cells have passed all unittest cells successfully.

Failed test case: Got wrong data type for the preprocessed texts.
Expected: int64
Got: object

Failed test case: Got wrong data type for the preprocessed labels.
Expected: int64
Got: object

Failed test case: Got wrong shape for the preprocessed texts. Make sure that MAX_LENGTH is set to 120 before submitting.
Expected: (32, 120)
Got: (32,)

Thanks,
Madie

Did you pass the fit label encoder unittest ??

Yes :slight_smile:

Hi @mallen

As you mentioned you have passed the previous unittest cell, I am only sharing solution for the grade cell you are currently having issue. In case it throws again any error after the correction than the previous grade cell codes need to be looked upon, let me know if that happens.

For now please refer the below comment(remember the max length is passed upon in the previous grade cell vectorizer code, that’s why I asked if your previous unittest cell passed or not) Max length code writing is not part of the preprocess data

Regards
DP

share the new error you got, as suspected there is issue with your previous grade cell.

To take a screenshot, use the prtscreen button for nonmac, and for mac it is shift, command and 3

@mallen

you can paste the error here rather than sending DM.

According to most recent you have used vectorizer.adapt to the dataset code which is an incorrect code.

vectorizer.adapt is suppose to be used in the previous grade cell.

Not defined error is coming because you are using a function recall incorrect as it was not recalled as label but labels.

Being said that your below code still incorrect.

dataset = dataset.map(lambda text, label : text_vectorizer.adapt(text), label_encoder.adapt(label))
20 #text_vectorizer.adapt(dataset.map(lambda text, label: text))
21 #label_encoder.adapt(dataset.map(lambda text, label: label))

you only need to use lambda: and then mentions text with its recalled function to text_,vectorizer and same for labels is label_encoder

if using this threw an error IOPub rate limit error that means your previous grade cells are incorrect.

You can DM me that code cell by personal DM and also send how you corrected the preprocess codes.

The reason I ask for screenshot of error or codes is not for my benefit as minor syntax error gets missed with copy paste and learners end up finding what is the issue which happened when I was addressing issue for another learner that he had missed ) in his codes.

@mallen

your fit label encoder codes are correct

now for dataset,
see the the image

dataset given

here the labels and text were written separately in two dataset codes line but you need to write in one code line.

another difference here in this image is they have not use the recalled function from the fit encoder label which you need to use while writing the dataset codes

lambda text: mentions here first for text with it recalled function i.e. text_vectorizer and for labels with its recalled function i.e. label_encoder. You had used label instead labels.

This is direct hint, after this I have to directly give you the written code :crazy_face:

Regards
DP