Issues with remove_stopwords() in weekly assignment

I kept getting 439 words for the first article rather than the given answer of 436 (this article is also used in grading). I looked through the related posts here and verified that 436 is obtained by the code
words = [w for w in sentence.split() if w not in stopwords]
sentence = ’ '.join(words)

However, this code actually assumes that spaces are the only separators between words, which is not the case. For example, there are a lot of periods, parentheses, hyphens, numbers and other symbols in the corpus. With this code, “them.” are considered as different from “them”, of which the latter is a stopword, while the former is not and thus not removed. Moreover, texts like “26-inch”, “-”, “sky+”, “(liquid” are all viewed as words.

Assuming a word can only contain lower case letters and apostrophe (e.g. some of the stopwords like “you’re” contain apostrophes), one way to properly identify words and remove stopwords is by using regular expressions (otherwise we need to identify words by manual looping):
import re
words = [w for w in re.split(“[^a-z’]+”, sentence) if (len(w) > 0) and (w not in stopwords)]
sentence = ’ '.join(words)

I understand that string processing is not the point here, but if we are going to be graded on how many words are left after stopwords removed, the problem statement needs to unambiguously define what constitute a word. The answer can not depend on a specific way the grader has in mind to “approximately” implement the logic.

For me, if I were not to find the posts here revealing what implementation the answer is based on, there is no way for me to find out how the answer is 436 and mine is 439. Even with this information, it took me two hours to figure out for sure what the discrepancy is made up of. Life can be much easier if the task statement was made rigorous in the first place.

Spliting by whitespace is sufficient for this assignment. Don’t worry about additional string processing. Did you lowercase the sentence before stopword removal?

Thank you for your reply. I already figured out how to get the count 436 in the answer. I am just saying that it will be great if the notebook explicitly tells us “It is sufficient to split by white space here and don’t worry about the punctuation symbols.”

Agreed & thanks for bringing it up. I’ve notified the staff to look into this.

This has been updated, sorry for the delay and thanks for flagging