Question about UNQ_C7

Bingqi_Lian · August 2, 2022, 2:29pm

UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

GRADED_FUNCTION: preprocess_data

def preprocess_data(train_data, test_data, count_threshold):
“”"
Preprocess data, i.e.,
- Find tokens that appear at least N times in the training data.
- Replace tokens that appear less than N times by “” both for training and test data.
Args:
train_data, test_data: List of lists of strings.
count_threshold: Words whose count is less than this are
treated as unknown.

Returns:
    Tuple of
    - training data with low frequent words replaced by "<unk>"
    - test data with low frequent words replaced by "<unk>"
    - vocabulary of words that appear n times or more in the training data
"""
### START CODE HERE (Replace instances of 'None' with your code) ###

# Get the closed vocabulary using the train data
vocabulary = get_words_with_nplus_frequency(train_data, count_threshold=count_threshold)

# For the train data, replace less common words with "<unk>"
train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary, unknown_token="<unk>")

# For the test data, replace less common words with "<unk>"
test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary, unknown_token="<unk>")

### END CODE HERE ###
return train_data_replaced, test_data_replaced, vocabulary

Can anyone help me to check where I made a mistake? My replace_oov_words_by_unk function was tested and passed.

reinoudbosch · August 2, 2022, 10:21pm

Hi Binqi_Lian,

You may not want to hardcode unknown_token in replace_oov_words_by_unk

Bingqi_Lian · August 3, 2022, 6:53am

Yes, I tried. But it still shows me an error.

### START CODE HERE ###

# Get the closed vocabulary using the train data
vocabulary = get_words_with_nplus_frequency(train_data, count_threshold=count_threshold)

# For the train data, replace less common words with "<unk>"
train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)

# For the test data, replace less common words with "<unk>"
test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary)

### END CODE HERE ###
return train_data_replaced, test_data_replaced, vocabulary

reinoudbosch · August 3, 2022, 1:51pm

In that case there’s a problem somewhere else. If you cannot find it, feel free to send me your notebook as an attachment to a direct mail so I can have a look.

Bingqi_Lian · August 3, 2022, 2:55pm

I just sent my notebook to you. Thank you so much!

reinoudbosch · August 3, 2022, 3:55pm

Hi Binqi-Lian,

You have to include unknown_token as a parameter in replace_oov_by_unk; just don’t hardcode it!

Topic		Replies	Views
Failed testcases for w3_unittest.test_preprocess_data(preprocess_data) NLP with Probabilistic Models week-module-3	5	652	January 14, 2022
C2W3 Assignment Exercise 7, 8, and 10 NLP with Probabilistic Models week-module-3	6	293	May 9, 2024
Replace token with <unk> test cases failing NLP with Probabilistic Models week-module-3	2	563	January 17, 2023
Problem with the code NLP with Probabilistic Models week-module-3	2	525	September 20, 2022
C2W3 UNQ_C7 unittests failing NLP with Probabilistic Models week-module-3	5	450	July 20, 2023

Question about UNQ_C7

UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

GRADED_FUNCTION: preprocess_data

Related topics