Assignment inconsistent with course video: Frequencies for unique words or not?

Hello,

During the course video on calculating feature vectors for each tweet, it is mentioned that the second feature is “sum of the positive frequencies for every unique word on tweet m”. And in the example provided in the course video as shown below, does that as expected.
course_video

However, in the assignment, calculating the features for only unique words fails some of the tests and does not provide expected results. This is what I expected to work:

expected

But using the np.unique method to get the unique tokens of a processed tweet for which to calculate features will not work and it will fail the tests. However, if I remove the np.unique() method it will work. This does not follow the course video though because we will be taking into account duplicated tokens when calculating features.

Thank you for your help!

Hi @Yoones_Vaezi1

Very good observation :+1:

The assignment is consistent of weighing duplicate words:

great -> 0.516065
great great -> 0.532096
great great great -> 0.548062
great great great great -> 0.563929

The lecture video talks about feature extraction - building freqs dictionary (which should not have counted duplicate words)

Yes I saw how counting in duplicates makes a difference in amount of probability for different number of the word “great”, and it makes sense. It is just in conflict with what the course videos are presenting which consider duplicates only during frequency dictionary building and not during feature vector building for each tweet.

I think I agree with you :slight_smile:
The point should have been stated more explicitly in the assignment - that the prediction of being positive or negative depends on feature vector which counts-in duplicate words - which is different from the lecture videos.

And also the freqs dictionary was not build using only unique words (at least from the code snippet in the notebook):

for y,tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

So my previous point:

is not true.

All in all I think the freqs dictionary should have been built the way it is now - counting-in the duplicate words.

But the feature extraction should have probably been consistent with the lecture video - your solution - not counting-in duplicate words.

Anyways, it’s friday, holidays are coming, my head is spinning :smiley: but I think you are right :+1:

p.s. maybe @Elemento would correct me on this?

Hey @arvyzukai and @Yoones_Vaezi1,

You both are correct regarding this. Let me create an issue, either to add a small note to the assignment regarding this, or perhaps change the way extract_features should be implemented, so that it is consistent with the lecture video.

Cheers,
Elemento

1 Like