While doing the “Week 1” assignment, I was a bit surprised because my results didn’t match the expected output after parsing the “./data/bbc-text.csv” file: I got less number of words after calling the “remove_stopwords” function.
After doing a few tests I discovered that this assignment doesn’t expect the student to take punctuation into account while removing stopwords from the sentences. So when the stop list contains the word “them”, and we have two sentences:
- … for them in terms …
- … they play on them.
according to the assignment’s expectation - “them” should be removed only from the first one.
What is the reason for this? If I process test data from this assignment and remove stopwords with punctuation in mind - the final vocabulary gets smaller for 42 words:
== My result: Vocabulary contains 29672 words
== Expected: Vocabulary contains 29714 words
Here are some examples of words from test data, that can be removed: “them.”, “it.”, “for!”, “(you”, “up)”, “-which”, “[when”, “than…”, “it:”, etc.