I want to understand how the prior ratio P(positive)/P(negative) when multiplied with product of probabilities P(word|positive)/P(word|negative) of all training examples helps when we have unbalanced datasets.

Hi, mohit.

Imagine the case when you have an empty tweet - how would you classify it? If for example your dataset contained 80% of negative tweets, then would it be reasonable to predict that the empty tweet is negative or positive? (remember that your model only knows of the world that you provided to it, not the reasoning you might have as a human)

Extending this example would be application specific, but the idea is the same - how much “positive” does the tweet have to be to overcome the threshold of default negativity - the prior.