C2_W1 Assignment 1: Autocorrect - Exercise 1

Hello everyone,

I’m facing an issue with a Python function that processes text data, and I’m hoping to get some help from the community. The function process_data is supposed to read a text file, clean it by removing punctuation, converting it to lowercase, and splitting it into individual words.

When I process shakespeare.txt with this function, I get the following results:

  • First 10 words in the text: ['o', 'for', 'a', 'muse', 'of', 'fire', 'that', 'would', 'ascend', 'the'] (this matches expectations).
  • Number of unique words in the vocabulary: 6457.

However, the expected vocabulary count is 6116.

What I’ve tried:

  • Double-checked the regex: r'[^\w\s]' to ensure it removes all punctuation correctly.
  • Ensured all words are converted to lowercase before splitting.

Questions:

  1. Could the discrepancy be due to leftover artifacts in the text file like extra spaces or special characters?
  2. Is there a better way to normalize the text data to align the word count with expectations?

Any advice or insights would be greatly appreciated!

Thanks in advance for your help!

hi @chenkhoon

can you share screenshot of your output with expected output from exercise you are working upon. please make sure not to post any grade cell codes as it’s against community guidelines.

With assignment available to me, here is what would have might gone slightly incorrect.

check if you used global variable to read the file_name and continued to use the same variable to convert all the words in lower case.

Another issue encountered by other learners were when they were hard coding the codes to convert the words into lower case.

They were re.read the file_name and next iterating into convertion of lower case.

Another usual mistake was choosing incorrect pattern. I hope you used Use re.findall(pattern, string) and pattern r’\w+’

Look for the “Raw String Notation” section in the Python ‘re’ documentation to understand the difference between r’\W’, r’\W’ and ‘\W’.
For the pattern, decide between using ‘\s’, ‘\w’, ‘\s+’ or ‘\w+’. What do you think are the differences?

Regards
DP