C2_W1 Exercise 1 process_data to get correct unique word count

Hi, I worked on this exercise and couldn’t lower my set word count sufficiently to 6116 unique word count that is required for future exercises from my current 6205 unique words. I used re.sub to process the data before splitting into words using re.split. I also separated hyphenated words into two words hoping that would help but still couldn’t lower my word count sufficiently. I am looking forward to suggestions on how to fix this problem. Thanks, Josh

Hi @Joshua_Herman

You might have missed the detailed hints:

Detailed hints if you’re stuck

- Use 'with' syntax to read a file
- Decide whether to use 'read()' or 'readline(). What's the difference?
- You can use str.lower() to convert to lowercase.
- Use re.findall(pattern, string)
- Look for the "Raw String Notation" section in the Python 're' documentation to understand the difference between r'\W', r'\W' and '\W'.
- For the pattern, decide between using '\s', '\w', '\s+' or '\w+'. What do you think are the differences?



I get the same word count as Josh, 6205, using these steps:

  1. Convert the document text to lower case.
  2. Replace ‘-\n’ with ‘’ (to join words that continue to the next line).
  3. Replace ‘-’ with ’ ’ (to represent hypenated words as two words).
  4. Split the document into a list of words with the default whitespace separator.
  5. Remove non-alpha characters from each word, and remove empty words from the list.

I tried splitting contracted words (e.g., “arm’d” into “arm” and “d”), but that reduced the word count to 5961 (below the expected 6116).

Can we get more direction on how to process edge cases (e.g., how do we handle non-alpha characters in “words” after splitting the document text)?


Hi @John_Rule

You are overthinking :slight_smile: - after lowering the text, the only thing you need is re.findall(..) with correct arguments and that’s it.