C2_W3 Expectation of tokenize_sentences() function changes as assignment progresses

jfeller35 · May 30, 2025, 3:11pm

When writing the function for tokenize_sentences() the unit tests failed on this expectation.

Wrong output values.
Expected: [[‘really’, ‘really’, ‘long’, ‘sentence’, ‘.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed’, ‘;’, ‘so’, ‘long’, ‘…’]]
Got: [[‘really’, ‘really’, ‘long’, ‘sentence.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed;’, ‘so’, ‘long…’]].

I was surprised as I don’t see how you can split the sentence keeping repeated punctuation in its own section without using the python regex library. So I added that and tests passed.

A few lines later in the assignment I see this for the expected output on the test set.

[‘that’, ‘picture’, ‘i’, ‘just’, ‘seen’, ‘whoa’, ‘dere’, ‘!’, ‘!’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’]

This seems to contradict the unit test expectation. Luckily I was able to proceed. But I am wondering what the expectation is for how the tokenize_sentences() function is supposed to split on repeated punctuation. Should it keep the exclamations together like
‘!!’ or should it split on each character like ‘!’,‘!’?

Deepti_Prasad · May 30, 2025, 5:17pm

hi @jfeller35

are you stating you went ahead with the exercise further even if you were getting expected values in the previous unittest cell for tokenize sentences??

luizhcz · May 30, 2025, 5:37pm

Hello @jfeller35

The unit tests fail because your custom regex tokenizer leaves punctuation glued to the preceding word, whereas nltk.word_tokenize splits every punctuation mark into its own token. In the expected output ['really', 'really', 'long', 'sentence', '.', 'it', 'is', 'very', 'long', 'indeed', ';', 'so', 'long', '…'] the period, semicolon and ellipsis each appear as independent tokens. Your result ['really', 'really', 'long', 'sentence.', 'it', 'is', 'very', 'long', 'indeed;', 'so', 'long…'] still contains “sentence.”, “indeed;” and “long…”, showing that the regex did not detach the punctuation. nltk.word_tokenize is built on the Treebank rules: it automatically separates every punctuation character (including duplicates such as “!!” or “>>>”), merges three consecutive dots into a single ellipsis token, and handles many corner cases that a simple pattern like \w+|\S (I’m assuming this is your regular expression) misses. Replacing your regex with word_tokenize(sentence.lower()) therefore yields the expected array and passes the tests without additional manual rules.

rguron · May 30, 2025, 6:11pm

Hello @jfeller35,
Works well by following the Hints right above the Cell for tokenize_sentences.
Which suggests the approach required:

make all strings lowercase with str.lower
use nltk.word_tokenize to tokenize
It has a caution: If you used str.split instead of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.

jfeller35 · June 7, 2025, 3:16am

Thanks luizhcz. I didn’t realize it viewed the three periods as an ellipsis instead of three separate punctuation marks. Thanks.

Topic		Replies	Views
Exercise 3 - get_tokenized_data C2_W3 NLP with Probabilistic Models week-module-3	3	73	June 25, 2024
[SOLVED] Potential issue with tokenize_function in week2 lab Generative AI with Large Language Models week-module-2	1	133	May 26, 2024
C3W1_Assignment - punctuation in remove_stopwords() Natural Language Processing in TensorFlow week-module-1	2	591	July 16, 2023
C3_W4_Lab_1.ipynb \| C3 - Natural Language Processing in Tensorflow Natural Language Processing in TensorFlow week-module-4	1	196	October 4, 2023
C4 W3 Q3 Natural Language Processing NLP with Attention Models week-module-3	5	36	September 29, 2024

C2_W3 Expectation of tokenize_sentences() function changes as assignment progresses

Related topics