C2_W1 Exercise 1 process_data to get correct unique word count

Joshua_Herman · October 8, 2023, 5:39pm

Hi, I worked on this exercise and couldn’t lower my set word count sufficiently to 6116 unique word count that is required for future exercises from my current 6205 unique words. I used re.sub to process the data before splitting into words using re.split. I also separated hyphenated words into two words hoping that would help but still couldn’t lower my word count sufficiently. I am looking forward to suggestions on how to fix this problem. Thanks, Josh

arvyzukai · October 9, 2023, 6:10am

Hi @Joshua_Herman

You might have missed the detailed hints:

Detailed hints if you’re stuck

- Use 'with' syntax to read a file
- Decide whether to use 'read()' or 'readline(). What's the difference?
- You can use str.lower() to convert to lowercase.
- Use re.findall(pattern, string)
- Look for the "Raw String Notation" section in the Python 're' documentation to understand the difference between r'\W', r'\W' and '\W'.
- For the pattern, decide between using '\s', '\w', '\s+' or '\w+'. What do you think are the differences?

Cheers

John_Rule · October 10, 2023, 9:16pm

Hi,

I get the same word count as Josh, 6205, using these steps:

Convert the document text to lower case.
Replace ‘-\n’ with ‘’ (to join words that continue to the next line).
Replace ‘-’ with ’ ’ (to represent hypenated words as two words).
Split the document into a list of words with the default whitespace separator.
Remove non-alpha characters from each word, and remove empty words from the list.

I tried splitting contracted words (e.g., “arm’d” into “arm” and “d”), but that reduced the word count to 5961 (below the expected 6116).

Can we get more direction on how to process edge cases (e.g., how do we handle non-alpha characters in “words” after splitting the document text)?

Thanks
John

arvyzukai · October 11, 2023, 4:50am

Hi @John_Rule

You are overthinking - after lowering the text, the only thing you need is re.findall(..) with correct arguments and that’s it.

Cheers

Aaron_Newman · March 7, 2024, 2:35am

I’d ask’d to be inform’d: why are numbers words, and words with punctuation not words? ne’er I’d o guessed that they are looking for \w and not [a-z]+ or [a-z']

arvyzukai · March 7, 2024, 8:53am

Hi @Aaron_Newman

The answer is always on a case by case basis. If your task or application benefits from treating “words” with punctuation, then you should go with that option.

As for Assignment there are instructions for you to choose from:

For the pattern, decide between using ‘\s’, ‘\w’, ‘\s+’ or ‘\w+’. What do you think are the differences?

In other words, the course creators decided to go with this option and provided you a hint.

But, when you have your own application, you should decide for yourself what is “word” and what is not. Ultimately what matters is - are you achieving your goals or not, regardless what is or is not considered a “word”.

In any case, most of the modern tokenizers use sub-word tokens to represent text and this discussion would be irrelevant for most recent models.

Cheers

Topic		Replies	Views
C2_W1 Assignment 1: Autocorrect - Exercise 1 NLP with Probabilistic Models week-module-1	1	54	December 23, 2024
Stuck at process_data NLP with Probabilistic Models week-module-1	9	686	July 13, 2023
Problem obtaining unique words NLP with Probabilistic Models week-module-1	4	521	April 21, 2023
Error with lab C2_W1_Assignment NLP with Probabilistic Models week-module-1	2	527	May 17, 2023
Stuck on 1st exercice NLP with Probabilistic Models week-module-1	1	577	January 17, 2022

C2_W1 Exercise 1 process_data to get correct unique word count

Related topics