Count words for positive and negative frequencies

Peixi_Zhu · May 24, 2023, 6:40pm

I am very confused with the positive and negative frequencies in week 1.

First, what does a corpus and a tweet mean here? Is a training sample represented by an entire corpus, or by an individual sentence in the corpus? For the particular corpus example given in the course, why would someone say contradictory things at the same time like “I am happy because I am learning NLP”, then “I am sad …”. That does not sound like a realistic example.

My second question is when creating vector of 3, why the words “happy” and “because” are not counted towards positive frequency?

reinoudbosch · May 24, 2023, 7:49pm

Hi Peixi_Zhu,

The ‘corpus’ consists of the set of tweets. In the very limited example used, this includes the following tweets: ‘I am happy because I am learning NLP’, ‘I am happy’, ‘I am sad, I am not learning NLP’, and ‘I am sad’. There is no discussion of training examples at this point. The discussion solely serves to demonstrate how sentiment features can be extracted from text. It is not meant to be a realistic example, but someone may well tweet ‘I am sad’ while not learning NLP and ‘I am happy because I am learning NLP’ when learning NLP.
The words ‘happy’ and ‘because’ are not part of the tweet ‘I am sad, I am not learning NLP’. They are therefore not taken into account when calculating the sentiment features of the tweet ‘I am sad, I am not learning NLP’.
I hope this clarifies.

Peixi_Zhu · May 26, 2023, 2:54am

Thanks, reinoudbosch

In this slide since we are counting frequencies in the positive tweet, why are we looking at the negative tweet ‘I am sad, I am not learning NLP.’? And if we want to look at the negative tweet, why does the word ‘am’ have a frequency of 3, not 2?

reinoudbosch · May 26, 2023, 10:10am

Hi Peixi_Zhu,

For purpose of this example, it is assumed the model does not know whether it is looking at a positive or a negative tweet. In order to determine whether the tweet is positive or negative it needs to count how many times words in a particular tweet appear in positive tweets and how many times they appear in negative tweets. In this case, the tweet is “I am sad, I am not learning NLP”. The model first starts counting how often words in the tweet appear in positive tweets (the ‘positive’ feature). For ‘am’ this is 3. At 2:00 in the video, the model starts counting the number of times the words in the tweet appear in negative tweets.
It is true that the example can be confusing, because the frequency counts of the model were already determined using the negative tweet ‘I am sad, I am not learning NLP’, so the particular tweet has already been input into the model as a negative tweet. In the feature extraction example, this is ignored and the tweet is treated as if it is new to the model. It could have been more clear had a completely different tweet been used for which the feature extraction (‘positive’, ‘negative’) had been clarified.

Topic		Replies	Views
C1W1 - frequency extraction discrepancy between explanation and implementation NLP with Classification and Vector Spaces week-1	3	44	February 18, 2025
Confusion in Logistic Regression Overview NLP with Classification and Vector Spaces week-1	5	365	October 30, 2023
What if we have the same frequency score on both a positive and a negative tweet NLP with Classification and Vector Spaces week-1	1	547	December 31, 2021
Summing the frequencies of unique words, not DUPLICATES NLP with Classification and Vector Spaces week-1	4	531	June 15, 2022
Why do we take into account only unique words while adding Positive and Negative frequencies in the sentence? NLP with Classification and Vector Spaces week-1	1	531	February 7, 2022

Count words for positive and negative frequencies

Related topics