Part 2: Train Naive Bayes (failing check)

This is what I entered:

 ### START CODE HERE ###

    # calculate V, the number of unique words in the vocabulary
    
    simplified = [] 
    for tweet in train_x:
        simplified.extend(process_tweet(tweet))
    vocab = list(set(simplified))
    V = len(vocab)
    # print("size of vocab is", V)

    # calculate N_pos, N_neg, V_pos, V_neg
    N_pos, N_neg = 0, 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:
            
            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]

        # else, the label is negative
        else:
            
            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]
    
    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents
    D_pos = np.count_nonzero(train_y == 1)

    # Calculate D_neg, the number of negative documents
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    
    # For each word in the vocabulary...

    for word in vocab:
        # get the positive and negative frequency of the word
        try:
            freq_pos = freqs[(word, 1.0)]
            #print("freq_pos for", word, "is", freq_pos)
        except:
            freq_pos = 0
        try:
            freq_neg = freqs[(word, 0.0)]
            #print("freq_neg for", word, "is", freq_neg)
        except:
            freq_neg = 0

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)

        # print("Word is", word)

    ### END CODE HERE ###

The output is identical to what’s expected:

0.0
9165

But when I run the next check

# Test your function
w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)

I get the following error:

Wrong number of keys in loglikelihood dictionary. 
	Expected: 9165.
	Got: 148.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-ca55b160668b> in <module>
      1 # Test your function
----> 2 w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)

~/work/w2_unittest.py in test_train_naive_bayes(target, freqs, train_x, train_y)
    369         for key, value in test_case["expected"]["loglikelihood"].items():
    370 
--> 371             if np.isclose(result2[key], value):
    372                 count_good += 1
    373 

KeyError: 'sunglass'

What am I missing?

Hi, Sapiens.

I think the problem with your code is because you construct your vocab from tweets (through simplified) and not from freqs dictionary as suggested in the instructions:

  • You can then compute the number of unique words that appear in the freqs dictionary to get your 𝑉 (you can use the set function)

I am guessing your code runs into memory constraints when the unittest is run.

1 Like

I’m having tremendous difficulty figuring out how to get vocab. I’m pretty sure I understand the idea, but can’t figure out how to make the code do it. I think I’m supposed to get all the words without the 0 or 1, right?
Like if we have:
(‘word1’, 0), (‘word2’, 0)…etc.
Then my vocab would be:
‘word1’, ‘word2’…etc.,
right? How do i get Python do to that?

Edit: After working on it more, I think my previous assumption was wrong. I really need help understanding the vocab thing. What is that supposed to be?

Edit Again: I figured out a workaround that is clearly not what was meant for me to do, but it works, so I’m past that part of the assignment now. However, I’d still like to know how it was intended that I get vocab and V.

You are correct that you need to select the first element (the word) from each of the dictionary entries. Then you need to make them unique, since there are probably two entries for each word, right? They give you the hint of using the “set()” function in python. Check the documentation for that.

If you want to go “totally pythonic” here, you just feed the set function the output of a “list comprehension” that enumerates the first element of all the keys in the freqs dictionary. It’s just a single line of code.

2 Likes