NLP - C1W2 - Use of set() in finding vocab

Hi,

So this is more of a Python question really. In Ex 2 we are asked to find the size of or vocab via the use of set() and our dictionary freqs.

image

To be honest, I’ve never used the set() command before so even after looking at a few docs I’m not quite sure how this works/how to use it.

I already know how to calculate V, I just have removed all my answers here.

*Obviously just set(freqs) does not produce the desired result.

1 Like

Hi @Nevermnd

The below image should explains set() function basically supports mathematical operations such as union, intersection, difference, and symmetric difference. Kindly see the below image which explains you with an example

Regards
DP

Deepti has given the documentation, but the key point in this application here is that you want the unique words. The point is that some words have both a positive and negative entry in the freqs dictionary, so if you just take all the words from the keys of the dictionary, there will be duplicates. That’s what we are using “set()” for here. You could implement that with a loop, but it would be more complicated code and would be (I assume) quite a bit slower. Mind you, I have no idea how the “set()” function is actually implemented, so maybe I’m wrong about the efficiency part of the argument. Hmmmm, we could test it out. But using set() is clearly very simple for us to write. Less code is better, right?

Suppose I gave you a python list of strings each of which is a word. How would you write the code to make sure that the list does not contain any duplicate words?

2 Likes

Right, think about what the entries in freqs are: they are pairs of (key, value) in which the key is a tuple of (word, sentiment) and value is the count for that key.

What you want here is the unique words from the keys, right?

1 Like

The whole basic of getting a set of unique words in vocabulary fits in with what Paul has added to explanation. Basically it create a new set using elements passed during the call. It takes iterable as an argument and returns a new set object.

Regards
DP

1 Like

Okay, I think I see it now. Sometimes I still have a little trouble working out iterators in Python-- I know in some sense they are much easier/more flexible than some brute force indexing approach, but at the same time thinking in that way just comes more naturally to me I think.

But, yes you are right. I mean I was using len(set(…)), but then the keys are tuples and we only want the set of words, regardless of the associated sentiment. I figured it out.

1 Like

So, how would you extract “word” from tuple (word, sentiment)?
Thx

1 Like

You can go out the same way you came in:

>>> my_tuple = ("happy", "positive")
>>> word, _ = my_tuple
>>> print(word)
happy
1 Like

There are lots of ways in python. Anthony has shown you one. A tuple can also be indexed:

word = key[0]

which selects the first element of the tuple. In this instance your best bet is to take that idea and create a python “enumeration” from that.

1 Like

Thanks!

1 Like

I đi what you and Nevermnd suggested. Now I got this error and I couldn’t figure it out after 4 hours debugging. Pls more hint. Thx

{moderator edit - solution code removed}

1 Like

I added the print statement like yours to show the first 10 words in the vocabulary and here’s what I see:

['cash', 'breath', 'wan', 'balkan', 'sfvbeta', 'crowdfund', 'angle.nelson', 'brewproject', "else'", 'meee']
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
0.0
9165

Note that I have some other instrumentation in addition to just that.

So my freqs dictionary has the words in a different order. I’m not sure off hand whether that is in itself a problem. The first question is whether your count_tweets function passed the tests and whether you ran everything in order.

The code you show is a little different than mine and you’re working a little too hard, but let me try it your way and see what happens.

1 Like

Hmmm. I wrote the code to generate the set the same way you did and it works for me. The words come out for me in the order I showed above. No change. So I think that means your count_tweets function works differently than mine.

There is one other problem we can see in your exception trace, though: the way you retrieve the freq_pos and freq_neg values in the loop over all the words is incorrect. Not every word will have an entry for both sentiments, so you need to return 0 when the key does not exist. The cleanest way to do that is to use the “get()” method on the freqs dictionary, but you can also use if statements.

I added “lawnmow” to my instrumentation and here are my new results (still using your version of the “set” code):

['cash', 'breath', 'wan', 'balkan', 'sfvbeta', 'crowdfund', 'angle.nelson', 'brewproject', "else'", 'meee']
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
freq_pos for lawnmow = 1
freq_neg for lawnmow = 0
loglikelihood for lawnmow = 0.6823294546700677
0.0
9165
1 Like

Yup. I found this overlook ust as I saw your comment. Thanks for great help :+1:

2 Likes