When writing the function for tokenize_sentences() the unit tests failed on this expectation.
Wrong output values.
Expected: [[‘really’, ‘really’, ‘long’, ‘sentence’, ‘.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed’, ‘;’, ‘so’, ‘long’, ‘…’]]
Got: [[‘really’, ‘really’, ‘long’, ‘sentence.’, ‘it’, ‘is’, ‘very’, ‘long’, ‘indeed;’, ‘so’, ‘long…’]].
I was surprised as I don’t see how you can split the sentence keeping repeated punctuation in its own section without using the python regex library. So I added that and tests passed.
A few lines later in the assignment I see this for the expected output on the test set.
[‘that’, ‘picture’, ‘i’, ‘just’, ‘seen’, ‘whoa’, ‘dere’, ‘!’, ‘!’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’, ‘>’]
This seems to contradict the unit test expectation. Luckily I was able to proceed. But I am wondering what the expectation is for how the tokenize_sentences() function is supposed to split on repeated punctuation. Should it keep the exclamations together like
‘!!’ or should it split on each character like ‘!’,‘!’?