Confusion in Logistic Regression Overview


In this image, a tweet is considered and a matrix is created of [bias,+ve freq, -ve freq].
Looking at the tweet, we can see that we can’t have 3476 as +ve frequency and 245 as -ve frequency. Because each word appears one time.

Can someone explain this?

Thanks

Hi @Muhammad_Bilal_Hanee

I’m not sure I understand - why can’t we have 3476 and 245? Four numbers (four words) can easily add up to 3476 and 245, for example, 2000 + 1000 + 401 + 75 (positive counts) and 100 + 101 + 40 + 4 (negative counts).

Cheers

The number represents the total number of that word in positive or negative corpse. It is like how positive/negative that word is or how often it is used in a positive/negative context.

But in an example in the section:

On the left, we have a dictionary of unique words and on the right, we have a tweet, we can see that “I” is seen 3 times and added in the frequency just like “am” and so on.
Now look at the processed tweett [tun, ai, great, model], we can see that each word i unique so that they all should be added only 1 time in the frequency table for positive tweet making a sum of 4. Why is it 3476?

It should not add to 4, it should add to 3476 (as in the example I gave you). Imagine, that you have 10000 tweets and 5000 of them are positive and 5000 are negative. Then imagine, that:

  • the word “tun” appeared 2000 times in positive tweets and 100 in negative tweets,
  • also the word “ai” appeared 1000 times in positive and 101 negative,
  • “great” appeared 401 times in positive and 40 times in negative,
  • “model” appeared 75 times in positive and 4 times in negative

The resulting vector of that imaginary scenario would result in the picture you were asking. In other words, you would encode the whole sentence in blue into the vector of [1, 3476, 245] and make the prediction based on this vector.

Cheers

Thanks, I was getting confused about the vocabulary and corpus of words. What I was thinking was that I have four unique words and if I find these words in a vocabulary then they would all appear only one time, neglecting that I have to create a frequency table by looking at the corpus.