C2_W3 Expectation of tokenize_sentences() function changes as assignment progresses

When writing the function for tokenize_sentences() the unit tests failed on this expectation.

Wrong output values.
Expected: [[‘really’, ‘really’, ‘long’, ‘sentence’, ‘.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed’, ‘;’, ‘so’, ‘long’, ‘…’]]
Got: [[‘really’, ‘really’, ‘long’, ‘sentence.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed;’, ‘so’, ‘long…’]].

I was surprised as I don’t see how you can split the sentence keeping repeated punctuation in its own section without using the python regex library. So I added that and tests passed.

A few lines later in the assignment I see this for the expected output on the test set.

[‘that’, ‘picture’, ‘i’, ‘just’, ‘seen’, ‘whoa’, ‘dere’, ‘!’, ‘!’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’]

This seems to contradict the unit test expectation. Luckily I was able to proceed. But I am wondering what the expectation is for how the tokenize_sentences() function is supposed to split on repeated punctuation. Should it keep the exclamations together like
‘!!’ or should it split on each character like ‘!’,‘!’?

hi @jfeller35

are you stating you went ahead with the exercise further even if you were getting expected values in the previous unittest cell for tokenize sentences??

Hello @jfeller35

The unit tests fail because your custom regex tokenizer leaves punctuation glued to the preceding word, whereas nltk.word_tokenize splits every punctuation mark into its own token. In the expected output ['really', 'really', 'long', 'sentence', '.', 'it', 'is', 'very', 'long', 'indeed', ';', 'so', 'long', '…'] the period, semicolon and ellipsis each appear as independent tokens. Your result ['really', 'really', 'long', 'sentence.', 'it', 'is', 'very', 'long', 'indeed;', 'so', 'long…'] still contains “sentence.”, “indeed;” and “long…”, showing that the regex did not detach the punctuation. nltk.word_tokenize is built on the Treebank rules: it automatically separates every punctuation character (including duplicates such as “!!” or “>>>”), merges three consecutive dots into a single ellipsis token, and handles many corner cases that a simple pattern like \w+|\S (I’m assuming this is your regular expression) misses. Replacing your regex with word_tokenize(sentence.lower()) therefore yields the expected array and passes the tests without additional manual rules.

1 Like

Hello @jfeller35,
Works well by following the Hints right above the Cell for tokenize_sentences.
Which suggests the approach required:

  1. make all strings lowercase with str.lower
  2. use nltk.word_tokenize to tokenize
  3. It has a caution: If you used str.split instead of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.

Thanks luizhcz. I didn’t realize it viewed the three periods as an ellipsis instead of three separate punctuation marks. Thanks.