During the course video on calculating feature vectors for each tweet, it is mentioned that the second feature is “sum of the positive frequencies for every unique word on tweet m”. And in the example provided in the course video as shown below, does that as expected.
However, in the assignment, calculating the features for only unique words fails some of the tests and does not provide expected results. This is what I expected to work:
But using the np.unique method to get the unique tokens of a processed tweet for which to calculate features will not work and it will fail the tests. However, if I remove the np.unique() method it will work. This does not follow the course video though because we will be taking into account duplicated tokens when calculating features.
Yes I saw how counting in duplicates makes a difference in amount of probability for different number of the word “great”, and it makes sense. It is just in conflict with what the course videos are presenting which consider duplicates only during frequency dictionary building and not during feature vector building for each tweet.
I think I agree with you
The point should have been stated more explicitly in the assignment - that the prediction of being positive or negative depends on feature vector which counts-in duplicate words - which is different from the lecture videos.
And also the freqs dictionary was not build using only unique words (at least from the code snippet in the notebook):
for y,tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
So my previous point:
is not true.
All in all I think the freqs dictionary should have been built the way it is now - counting-in the duplicate words.
But the feature extraction should have probably been consistent with the lecture video - your solution - not counting-in duplicate words.
Anyways, it’s friday, holidays are coming, my head is spinning but I think you are right
You both are correct regarding this. Let me create an issue, either to add a small note to the assignment regarding this, or perhaps change the way extract_features should be implemented, so that it is consistent with the lecture video.