Count words for positive and negative frequencies

I am very confused with the positive and negative frequencies in week 1.

First, what does a corpus and a tweet mean here? Is a training sample represented by an entire corpus, or by an individual sentence in the corpus? For the particular corpus example given in the course, why would someone say contradictory things at the same time like “I am happy because I am learning NLP”, then “I am sad …”. That does not sound like a realistic example.

My second question is when creating vector of 3, why the words “happy” and “because” are not counted towards positive frequency?

Hi Peixi_Zhu,

The ‘corpus’ consists of the set of tweets. In the very limited example used, this includes the following tweets: ‘I am happy because I am learning NLP’, ‘I am happy’, ‘I am sad, I am not learning NLP’, and ‘I am sad’. There is no discussion of training examples at this point. The discussion solely serves to demonstrate how sentiment features can be extracted from text. It is not meant to be a realistic example, but someone may well tweet ‘I am sad’ while not learning NLP and ‘I am happy because I am learning NLP’ when learning NLP.
The words ‘happy’ and ‘because’ are not part of the tweet ‘I am sad, I am not learning NLP’. They are therefore not taken into account when calculating the sentiment features of the tweet ‘I am sad, I am not learning NLP’.
I hope this clarifies.

Thanks, reinoudbosch

In this slide since we are counting frequencies in the positive tweet, why are we looking at the negative tweet ‘I am sad, I am not learning NLP.’? And if we want to look at the negative tweet, why does the word ‘am’ have a frequency of 3, not 2?

Hi Peixi_Zhu,

For purpose of this example, it is assumed the model does not know whether it is looking at a positive or a negative tweet. In order to determine whether the tweet is positive or negative it needs to count how many times words in a particular tweet appear in positive tweets and how many times they appear in negative tweets. In this case, the tweet is “I am sad, I am not learning NLP”. The model first starts counting how often words in the tweet appear in positive tweets (the ‘positive’ feature). For ‘am’ this is 3. At 2:00 in the video, the model starts counting the number of times the words in the tweet appear in negative tweets.
It is true that the example can be confusing, because the frequency counts of the model were already determined using the negative tweet ‘I am sad, I am not learning NLP’, so the particular tweet has already been input into the model as a negative tweet. In the feature extraction example, this is ignored and the tweet is treated as if it is new to the model. It could have been more clear had a completely different tweet been used for which the feature extraction (‘positive’, ‘negative’) had been clarified.