# Baye's rule and Naive Bayes

The lecture introduced both Baye’s rule and Naive Bayes.
Baye’s rule is P(X|Y)=P(Y|X)\frac{P(X)}{P(Y)}

And Naive Bayes conditional prob for binary classification is \prod_i{\frac{P(w_i|pos)}{P(w_i|neg)}}

I don’t understand how these two are related to each other. How is each component of Baye’s rule mapped to Naive Bayes ?
Could some one explain ? Thanks.

Hi @mc04xkf

In simple words:

• the first one give you the values for P(w_i|pos) and P(w_i|neg),
• the second one combines them (for a sequence);

Btw, it’s called Naive because we combine the ratios as independent and just multiply them as if it was coin flips. In language words do depend on each other (they are not coin flips) but for some simple applications we just can get away with it.

Cheers

To get P(w_i|pos) and P(w_i|neg), we don’t need need Bayes’ rule, do we ?
It is simply the number of times a word appear in pos/neg tweets, divided by the number of pos/neg tweets in total.

@mc04xkf Bayes’ rule helps us transition between conditional probabilities:

• P(pos|w_i)P(w_i|pos) and P(neg|w_i)P(w_i|neg).

If we have the table of the whole world of “counts” - we don’t need Bayes’ rule. But usually that is not the case and we only have particular conditional probabilities. Then Bayes’ rule helps us find the “opposite”. For example, in our case we know P(w_i|pos) (we interpret the data as that the word “i” happened in a positive tweet, or probability of word “i” given the tweet is positive) and we want to get to P(pos|w_i) and that is why we do all the calculations.

We could have spared the trouble and just calculated P(pos|w_i) by dividing positive word i count by overall word i count. But this would be accurate if we had all the tweet data in the world or we interpret the data as the positive tweet happened because of the words i (which neither is the case).

Sorry but I still don’t get it.

1. You said

we want to get to P(pos|w_i)

but we actually want P(w_i|pos) and P(w_i|neg) as in \prod{\frac{P(w_i|pos)}{P(w_i|neg}}

1. You said

this would be accurate if we had all the tweet data in the world

In supervised learning, the “world” is the training data, i.e. we only collect the statistics/probabilities from the training data, so it is easy/possible to get P(pos|w_i)

@mc04xkf

No problem, let me explain in more detail. From Bayes’ rule we know:

P(pos|w_i)=\frac{P(w_i|pos)*P(pos)}{P(w_i)};

P(neg|w_i)=\frac{P(w_i|neg)*P(neg)}{P(w_i)};

What would the fraction \frac{P(pos|w_i)}{P(neg|w_i)} be equal to?

I would suggest to derive it on your own on a paper, this way you would remember it for longer, but in any case, here is the solution:

\frac{P(pos|w_i)}{P(neg|w_i)}=\frac{\frac{P(w_i|pos)*P(pos)}{P(w_i)}}{\frac{P(w_i|neg)*P(neg)}{P(w_i)}}=\frac{P(w_i|pos)*P(pos)*P(w_i)}{P(w_i|neg)*P(neg)*P(w_i)}= \frac{P(w_i|pos)*P(pos)*\bcancel{P(w_i)}}{P(w_i|neg)*P(neg)*\bcancel{P(w_i)}}=\\=\frac{P(pos)}{P(neg)}*\frac{P(w_i|pos)}{P(w_i|neg)};

I would suggest to read more about the Bayesian thinking and the difference between frequentist (or classical… on the other hand, nowadays everything that is not Deep Learning is classical ) statistics.

I wish I could provide some links but I think you are better at finding them yourself (I primarily learned about it a while ago from an excellent book by Allen B. Downey, “Think Bayes” (now there’s a second edition)).

It’s hard to explain concisely, but if I tried:

• Frequentist statistics view the probability as depending on the outcome of the experiment if you repeat it infinitely. (They hold the parameters fixed and they model the data). For example, what is the probability of this tweet to contain these words if it is positive? In other words, like in a coin flip, the probability of being positive is fixed, but the outcome is generated.
• Bayesians view the probability as a measure of belief in the likelihood of an event happening. (They hold the data fixed and they model the parameters). For example, we have some prior belief of this tweet to be positive (let’s say 0.5). Then what is the probability of this tweet to be positive if we have this word? (let’s say it changes to 0.55) Then again, what is the probability of this tweet to be positive if we have another word? and so on. In other words, we know the outcome and we try to measure the probability (we know the coin flip results but what is the probability of the coin to be 0.5).

In our case we think that we know the P(w_i|pos) but not the P(pos|w_i) and we are trying to get to that.

Thanks @arvyzukai

I found the answer in sklearn’s document 1.9. Naive Bayes — scikit-learn 1.2.2 documentation

Captured as below

P[ L=0 | X=x ] = P[ ( X=x ) ∩ ( L=0 ) ] / P[ X=x ]
= ( P[ L=0 ] * P[ X=x | L=0 ] ) / P[ X=x ]
= ( P[ L=0 ] * P[ X=x | L=0 ] ) / ( P[ ( X=x ) ∩ ( L=0 ) ] + P[ ( X=x ) ∩ ( L=1 ) ] )
= ( P[ L=0 ] * P[ X=x | L=0 ] ) / ( ( P[ L=0 ] * P[ X=x | L=0 ] ) + ( P[ L=1 ] * P[ X=x | L=1 ] ) )
= 1 / ( 1 + ( ( P[ L=1 ] * P[ X=x | L=1 ] ) / ( P[ L=0 ] * P[ X=x | L=0 ] ) ) )
= 1 / ( 1 + ( ( P[ L=1 ] / P[ L=0 ] ) * ( P[ X=x | L=1 ] / P[ X=x | L=0 ] ) ) ) ← Here is what you’re looking for

Notes:

Given a Probability Space, ( Ω , F , P ) where
Ω: Sample Space
F: σ-Algebra
Ω: Probability Function

∙ P[ X=x ] = P[ ( X=x ) ∩ Ω ]
= P[ X=x ∩ ( ( L=0 ) ∪ ( L=1 ) ) ]
= P[ ( X=x ∩ L=0 ) ∪ ( X=x ∩ L=1 ) ]
= P[ ( X=x ∩ L=0 ) ] + P[ ( X=x ∩ L=1 ) ]

∙ Conditional Probability
P[ ( X=x ∩ L=0 ) ] = P[ L=0 ] * P[ X=x | L=0 ]

∙ Range of Probability Function
P[ L=0 | X=x ] ∈ [0,1]

∙ This expression is called Naive Bayes’ Inference in Younes’ notes.
( P[ L=1 ] / P[ L=0 ] ) * ( P[ X=x | L=1 ] / P[ X=x | L=0 ] ) ∈ [0,∞]

Personal Note, I understand that Younes is calculating P[ L=0 | X=x ] but given that this probability is determined by Naive Bayes’ Inference he is just using that for determine the label.
For example,

If Naive Bayes’ Inference → 0 then P[ L=0 | X=x ] → 1 , ie, It is high likely that the label is 0
If Naive Bayes’ Inference → ∞ then P[ L=0 | X=x ] → 0 , ie, It is low likely that the label is 0
If Naive Bayes’ Inference → 1 then P[ L=0 | X=x ] → 0.5 , ie, The label is “undefined”

I guess he occuped Naive Bayes’ Inference just for compute less operations given that we get the same objective, estimate the label.

A bit not understood how have we got equation after

for all i, this relationship is simplified to