Why log prior is calculated like that?

Prior is meant to correct the probabilities in case there’s unbalance in the dataset. Now I am going over the assignment and I notice that the prior is calculated: number of positive tweets / number of negative tweets

However, I do not think that is an efficient way to compute the prior.

Imagine the following scenario where we have an equal amount of positive and negative tweets but negative tweets tend to be shorter than the positive (this is often the case in real world scenarios)
It does not matter the we have an equality in terms of positive and negative tweets, what matters is that we do not have an equality in terms of positive and negative words

Therefore, I think that logprior should be calculated: number of positive words / number of negative words.

Is there anything I am missing?

Hi @popaqy

The prior is used correctly - it is your prior assessment of how imbalanced the dataset is.

The word probabilities in each class are normalized by N_{class} (that is not overall word count). So if there were less words in negative class, then what it would change is each word’s weight for that category.

For example, if word appears equally many times in positive and negative categories (for example 10), and if there are less negative words (for example 800, vs.1000 in positive), then your p(word|negative) would be greater than p(word|positive). (0.0125 > 0.01). Imagine that the tweet is that single word. And you know nothing about the rest of the world except that prior you had 900 positive and 900 negative tweets, then your prior should be equally balanced and not rebalanced according to word counts.

P.S. if you would decide to inject the number of words information to the model, then you would probably do the same thing but not with exact words, but the count of the words (tweet length) in each category (and then normalize the same way). And your prior would be equally balanced still.

Hi @popaqy,

I don’t remember the material of NLP as much as would have wanted to, but as you progress course by course, you’ll see things will be more advance and this might be taken care of in those concepts.

Best,
Mubsi