C2_W1 Assignment 1: Autocorrect - Exercise 1

chenkhoon · December 23, 2024, 12:16am

Hello everyone,

I’m facing an issue with a Python function that processes text data, and I’m hoping to get some help from the community. The function process_data is supposed to read a text file, clean it by removing punctuation, converting it to lowercase, and splitting it into individual words.

When I process shakespeare.txt with this function, I get the following results:

First 10 words in the text: ['o', 'for', 'a', 'muse', 'of', 'fire', 'that', 'would', 'ascend', 'the'] (this matches expectations).
Number of unique words in the vocabulary: 6457.

However, the expected vocabulary count is 6116.

What I’ve tried:

Double-checked the regex: r'[^\w\s]' to ensure it removes all punctuation correctly.
Ensured all words are converted to lowercase before splitting.

Questions:

Could the discrepancy be due to leftover artifacts in the text file like extra spaces or special characters?
Is there a better way to normalize the text data to align the word count with expectations?

Any advice or insights would be greatly appreciated!

Thanks in advance for your help!

Deepti_Prasad · December 23, 2024, 2:23am

hi @chenkhoon

can you share screenshot of your output with expected output from exercise you are working upon. please make sure not to post any grade cell codes as it’s against community guidelines.

With assignment available to me, here is what would have might gone slightly incorrect.

check if you used global variable to read the file_name and continued to use the same variable to convert all the words in lower case.

Another issue encountered by other learners were when they were hard coding the codes to convert the words into lower case.

They were re.read the file_name and next iterating into convertion of lower case.

Another usual mistake was choosing incorrect pattern. I hope you used Use re.findall(pattern, string) and pattern r’\w+’

Look for the “Raw String Notation” section in the Python ‘re’ documentation to understand the difference between r’\W’, r’\W’ and ‘\W’.
For the pattern, decide between using ‘\s’, ‘\w’, ‘\s+’ or ‘\w+’. What do you think are the differences?

Regards
DP

Topic		Replies	Views
C2_W1 Exercise 1 process_data to get correct unique word count NLP with Probabilistic Models week-1	5	497	March 7, 2024
Stuck at process_data NLP with Probabilistic Models week-1	9	686	July 13, 2023
Error with lab C2_W1_Assignment NLP with Probabilistic Models week-1	2	518	May 17, 2023
Exercise 3 - get_tokenized_data C2_W3 NLP with Probabilistic Models week-3	3	72	June 25, 2024
Issues with remove_stopwords() in weekly assignment Natural Language Processing in TensorFlow week-1	4	622	January 19, 2023

C2_W1 Assignment 1: Autocorrect - Exercise 1

What I’ve tried:

Questions:

Related topics