Stuck at process_data

I get 6457 words and fail one of the unitests. Could someone please help with the regular expression to get the correct number of words? I have tried using string.punctuation instead but id doesn’t give the correct number.
Thank you

Similar problem (just with 6303 words), it is super annoying to not be able to get the difference of the desired outcome to see what the exact sepcification is.

The answer turned out to be: They consider a “word” whatever string of characters only consists of what pythons re \w considers a wordcharacter. Therefore literary contractions (if you look at the text, things like “deceiv’d” or “suggest’d”) are counted as two words (‘suggest’ and ‘d’).

If you don’t use the path they foresee for the implementation as hinted by in the hints (never explicitly stated), you are probably doomed to run into unnecessary problems.

Thank you! I definitely agree that it would help to be able to see the difference.

Thanks Ketzu ! Indeed, if there is a specific definition for what defines a word, the requirements should be elaborated specifically. I would have never guessed that.

Assume numbers (eg. page number) is ‘word’ as well. So far, process_data() is the worst defined exercise :-/ I’ve done it in 5 minutes and than spent 1.5 hour to “try&guess” correct pattern…

Unittest doesn’t help. It check only 10 first/last words. It doesn’t help, I’ve obtained many different combination with splitting words along “'”, “-”, numbers, etc… to find magic 6116 unique words :wink:

most hatefull assigment. Why should I be expert in re? And I still can’t guess what to use. This is part of my code and I still get 6030 unique words.

with open(file_name, encoding="utf-8") as f:
    for line in f.readlines():
   #  line = re.sub(r'[^\w\s]', '', line)
        for word in line.split():
            word = word.lower()
            word = re.findall(r'\w+', word)
            if word:
                if word[0] not in words:
                    print(word[0])
                    words.append(word[0])

Help me anybody, because I can go off the rails soon.

at least this craziness ended. You should use something like this
with open(file_name, encoding=“utf-8”) as f:
for word in re.findall(r’\w+', f.read()):

1 Like

I also had 6457 when I tried to read the file, process the data, and then append to the list (at which point I came to this page, but didn’t find it too helpful).

When I took a step back and followed the directions in the block and the detailed hints more directly, I had it working as they expect almost immediately. The only change I made was that I only brought things to lower case once.

You find additional hints on how to implement it in the lab “building_the_vocabulary_model”, including information on how to read the data and implement the re.findall function correctly. Highly utile if stuck.