Assignment inconsistent with course video: Frequencies for unique words or not?

Yoones_Vaezi1 · April 7, 2023, 6:56am

Hello,

During the course video on calculating feature vectors for each tweet, it is mentioned that the second feature is “sum of the positive frequencies for every unique word on tweet m”. And in the example provided in the course video as shown below, does that as expected.
course_video

However, in the assignment, calculating the features for only unique words fails some of the tests and does not provide expected results. This is what I expected to work:

expected

But using the np.unique method to get the unique tokens of a processed tweet for which to calculate features will not work and it will fail the tests. However, if I remove the np.unique() method it will work. This does not follow the course video though because we will be taking into account duplicated tokens when calculating features.

Thank you for your help!

arvyzukai · April 7, 2023, 7:18am

Hi @Yoones_Vaezi1

Very good observation

The assignment is consistent of weighing duplicate words:

great -> 0.516065
great great -> 0.532096
great great great -> 0.548062
great great great great -> 0.563929

The lecture video talks about feature extraction - building freqs dictionary (which should not have counted duplicate words)

Yoones_Vaezi1 · April 7, 2023, 7:26am

Yes I saw how counting in duplicates makes a difference in amount of probability for different number of the word “great”, and it makes sense. It is just in conflict with what the course videos are presenting which consider duplicates only during frequency dictionary building and not during feature vector building for each tweet.

arvyzukai · April 7, 2023, 7:44am

I think I agree with you
The point should have been stated more explicitly in the assignment - that the prediction of being positive or negative depends on feature vector which counts-in duplicate words - which is different from the lecture videos.

And also the freqs dictionary was not build using only unique words (at least from the code snippet in the notebook):

for y,tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

So my previous point:

is not true.

All in all I think the freqs dictionary should have been built the way it is now - counting-in the duplicate words.

But the feature extraction should have probably been consistent with the lecture video - your solution - not counting-in duplicate words.

Anyways, it’s friday, holidays are coming, my head is spinning but I think you are right

p.s. maybe @Elemento would correct me on this?

Elemento · April 7, 2023, 8:53am

Hey @arvyzukai and @Yoones_Vaezi1,

You both are correct regarding this. Let me create an issue, either to add a small note to the assignment regarding this, or perhaps change the way extract_features should be implemented, so that it is consistent with the lecture video.

Cheers,
Elemento

Topic		Replies	Views
C1W1 - frequency extraction discrepancy between explanation and implementation NLP with Classification and Vector Spaces week-module-1	3	46	February 18, 2025
Challenged with Unique Word Calculation for Vocabulary NLP with Classification and Vector Spaces week-module-2 , week-module-3	24	781	March 21, 2022
Why do we take into account only unique words while adding Positive and Negative frequencies in the sentence? NLP with Classification and Vector Spaces week-module-1	1	531	February 7, 2022
C1_W1: Erro in frequency count part 2 NLP with Classification and Vector Spaces week-module-1	5	461	August 6, 2023
Confusion in Logistic Regression Overview NLP with Classification and Vector Spaces week-module-1	5	365	October 30, 2023

Assignment inconsistent with course video: Frequencies for unique words or not?

Related topics