The below image should explains set() function basically supports mathematical operations such as union, intersection, difference, and symmetric difference. Kindly see the below image which explains you with an example
Deepti has given the documentation, but the key point in this application here is that you want the unique words. The point is that some words have both a positive and negative entry in the freqs dictionary, so if you just take all the words from the keys of the dictionary, there will be duplicates. That’s what we are using “set()” for here. You could implement that with a loop, but it would be more complicated code and would be (I assume) quite a bit slower. Mind you, I have no idea how the “set()” function is actually implemented, so maybe I’m wrong about the efficiency part of the argument. Hmmmm, we could test it out. But using set() is clearly very simple for us to write. Less code is better, right?
Suppose I gave you a python list of strings each of which is a word. How would you write the code to make sure that the list does not contain any duplicate words?
Right, think about what the entries in freqs are: they are pairs of (key, value) in which the key is a tuple of (word, sentiment) and value is the count for that key.
What you want here is the unique words from the keys, right?
The whole basic of getting a set of unique words in vocabulary fits in with what Paul has added to explanation. Basically it create a new set using elements passed during the call. It takes iterable as an argument and returns a new set object.
Okay, I think I see it now. Sometimes I still have a little trouble working out iterators in Python-- I know in some sense they are much easier/more flexible than some brute force indexing approach, but at the same time thinking in that way just comes more naturally to me I think.
But, yes you are right. I mean I was using len(set(…)), but then the keys are tuples and we only want the set of words, regardless of the associated sentiment. I figured it out.
Note that I have some other instrumentation in addition to just that.
So my freqs dictionary has the words in a different order. I’m not sure off hand whether that is in itself a problem. The first question is whether your count_tweets function passed the tests and whether you ran everything in order.
The code you show is a little different than mine and you’re working a little too hard, but let me try it your way and see what happens.
Hmmm. I wrote the code to generate the set the same way you did and it works for me. The words come out for me in the order I showed above. No change. So I think that means your count_tweets function works differently than mine.
There is one other problem we can see in your exception trace, though: the way you retrieve the freq_pos and freq_neg values in the loop over all the words is incorrect. Not every word will have an entry for both sentiments, so you need to return 0 when the key does not exist. The cleanest way to do that is to use the “get()” method on the freqs dictionary, but you can also use if statements.
I added “lawnmow” to my instrumentation and here are my new results (still using your version of the “set” code):
['cash', 'breath', 'wan', 'balkan', 'sfvbeta', 'crowdfund', 'angle.nelson', 'brewproject', "else'", 'meee']
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
freq_pos for lawnmow = 1
freq_neg for lawnmow = 0
loglikelihood for lawnmow = 0.6823294546700677
0.0
9165