C2_W1 Exercise 1 process_data to get correct unique word count

Hi, I worked on this exercise and couldn’t lower my set word count sufficiently to 6116 unique word count that is required for future exercises from my current 6205 unique words. I used re.sub to process the data before splitting into words using re.split. I also separated hyphenated words into two words hoping that would help but still couldn’t lower my word count sufficiently. I am looking forward to suggestions on how to fix this problem. Thanks, Josh

Hi @Joshua_Herman

You might have missed the detailed hints:

Detailed hints if you’re stuck

- Use 'with' syntax to read a file
- Decide whether to use 'read()' or 'readline(). What's the difference?
- You can use str.lower() to convert to lowercase.
- Use re.findall(pattern, string)
- Look for the "Raw String Notation" section in the Python 're' documentation to understand the difference between r'\W', r'\W' and '\W'.
- For the pattern, decide between using '\s', '\w', '\s+' or '\w+'. What do you think are the differences?



I get the same word count as Josh, 6205, using these steps:

  1. Convert the document text to lower case.
  2. Replace ‘-\n’ with ‘’ (to join words that continue to the next line).
  3. Replace ‘-’ with ’ ’ (to represent hypenated words as two words).
  4. Split the document into a list of words with the default whitespace separator.
  5. Remove non-alpha characters from each word, and remove empty words from the list.

I tried splitting contracted words (e.g., “arm’d” into “arm” and “d”), but that reduced the word count to 5961 (below the expected 6116).

Can we get more direction on how to process edge cases (e.g., how do we handle non-alpha characters in “words” after splitting the document text)?


Hi @John_Rule

You are overthinking :slight_smile: - after lowering the text, the only thing you need is re.findall(..) with correct arguments and that’s it.


I’d ask’d to be inform’d: why are numbers words, and words with punctuation not words? ne’er I’d o guessed that they are looking for \w and not [a-z]+ or [a-z'] :slight_smile:

Hi @Aaron_Newman

The answer is always on a case by case basis. If your task or application benefits from treating “words” with punctuation, then you should go with that option.

As for Assignment there are instructions for you to choose from:

  • For the pattern, decide between using ‘\s’, ‘\w’, ‘\s+’ or ‘\w+’. What do you think are the differences?

In other words, the course creators decided to go with this option and provided you a hint.

But, when you have your own application, you should decide for yourself what is “word” and what is not. Ultimately what matters is - are you achieving your goals or not, regardless what is or is not considered a “word”.

In any case, most of the modern tokenizers use sub-word tokens to represent text and this discussion would be irrelevant for most recent models.