Logprior is counterproductive if training set doesn't match real world

The logprior we use in classifying a tweet helps us include some information we know about the data we trained on. If there were more negative tweets, we start with the added assumption that a tweet is likely to be negative to begin with.

I simply want to clarify that this means that training with data that is not representative of reality (in terms of class ratios) will harm our predictive ability using this particular algorithm.

If our training set is predominantly negative, but reality is predominantly positive, when we classify a tweet we are adding in the assumption that the tweet is more likely to be negative.

Suppose our training set includes a word that occurs evenly across classes, but our training set is predominantly negative. 2000 tweets split between 75% negative and 25% positive. Our word occurs in 150 positive tweets and 450 negative tweets, or 30% of positive tweets and 30% of negative tweets.
The logprior of our training set is -1.1, and the loglikelihood for this word is -0.69.

In this case, the logprior isn’t undoing the negative bias of the training set, it’s exacerbating it. Our word occurred evenly between classes, yet our algorithm predicts that the tweet is very likely negative.

This seems counterintuitive?

Hi @Nabil_Fairbairn

I don’t think so. On the contrary, I think it is quite obvious.

The most important thing is the dataset. If the dataset is not representative of the real world then the “Logprior” is not the only problem you have. No model would be good if you train it on wrong data.