# Part 2: Train Naive Bayes (failing check)

This is what I entered:

`````` ### START CODE HERE ###

# calculate V, the number of unique words in the vocabulary

simplified = []
for tweet in train_x:
simplified.extend(process_tweet(tweet))
vocab = list(set(simplified))
V = len(vocab)
# print("size of vocab is", V)

# calculate N_pos, N_neg, V_pos, V_neg
N_pos, N_neg = 0, 0
for pair in freqs.keys():
# if the label is positive (greater than zero)
if pair[1] > 0:

# Increment the number of positive words by the count for this (word, label) pair
N_pos += freqs[pair]

# else, the label is negative
else:

# increment the number of negative words by the count for this (word,label) pair
N_neg += freqs[pair]

# Calculate D, the number of documents
D = len(train_y)

# Calculate D_pos, the number of positive documents
D_pos = np.count_nonzero(train_y == 1)

# Calculate D_neg, the number of negative documents
D_neg = D - D_pos

# Calculate logprior
logprior = np.log(D_pos) - np.log(D_neg)

# For each word in the vocabulary...

for word in vocab:
# get the positive and negative frequency of the word
try:
freq_pos = freqs[(word, 1.0)]
#print("freq_pos for", word, "is", freq_pos)
except:
freq_pos = 0
try:
freq_neg = freqs[(word, 0.0)]
#print("freq_neg for", word, "is", freq_neg)
except:
freq_neg = 0

# calculate the probability that each word is positive, and negative
p_w_pos = (freq_pos + 1) / (N_pos + V)
p_w_neg = (freq_neg + 1) / (N_neg + V)

# calculate the log likelihood of the word
loglikelihood[word] = np.log(p_w_pos / p_w_neg)

# print("Word is", word)

### END CODE HERE ###
``````

The output is identical to whatâs expected:

``````0.0
9165
``````

But when I run the next check

``````# Test your function
w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)
``````

I get the following error:

``````Wrong number of keys in loglikelihood dictionary.
Expected: 9165.
Got: 148.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-ca55b160668b> in <module>
----> 2 w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)

~/work/w2_unittest.py in test_train_naive_bayes(target, freqs, train_x, train_y)
369         for key, value in test_case["expected"]["loglikelihood"].items():
370
--> 371             if np.isclose(result2[key], value):
372                 count_good += 1
373

KeyError: 'sunglass'
``````

What am I missing?

Hi, Sapiens.

I think the problem with your code is because you construct your `vocab` from tweets (through `simplified`) and not from `freqs` dictionary as suggested in the instructions:

• You can then compute the number of unique words that appear in the `freqs` dictionary to get your đ (you can use the `set` function)

I am guessing your code runs into memory constraints when the unittest is run.

1 Like

Iâm having tremendous difficulty figuring out how to get vocab. Iâm pretty sure I understand the idea, but canât figure out how to make the code do it. I think Iâm supposed to get all the words without the 0 or 1, right?
Like if we have:
(âword1â, 0), (âword2â, 0)âŚetc.
Then my vocab would be:
âword1â, âword2ââŚetc.,
right? How do i get Python do to that?

Edit: After working on it more, I think my previous assumption was wrong. I really need help understanding the vocab thing. What is that supposed to be?

Edit Again: I figured out a workaround that is clearly not what was meant for me to do, but it works, so Iâm past that part of the assignment now. However, Iâd still like to know how it was intended that I get vocab and V.

You are correct that you need to select the first element (the word) from each of the dictionary entries. Then you need to make them unique, since there are probably two entries for each word, right? They give you the hint of using the âset()â function in python. Check the documentation for that.

If you want to go âtotally pythonicâ here, you just feed the set function the output of a âlist comprehensionâ that enumerates the first element of all the keys in the freqs dictionary. Itâs just a single line of code.

2 Likes