NLP - C1W2 - Use of set() in finding vocab

Nevermnd · June 27, 2024, 4:09pm

Hi,

So this is more of a Python question really. In Ex 2 we are asked to find the size of or vocab via the use of set() and our dictionary freqs.

To be honest, I’ve never used the set() command before so even after looking at a few docs I’m not quite sure how this works/how to use it.

I already know how to calculate V, I just have removed all my answers here.

*Obviously just set(freqs) does not produce the desired result.

Deepti_Prasad · June 27, 2024, 4:31pm

Hi @Nevermnd

The below image should explains set() function basically supports mathematical operations such as union, intersection, difference, and symmetric difference. Kindly see the below image which explains you with an example

Regards
DP

paulinpaloalto · June 27, 2024, 4:37pm

Deepti has given the documentation, but the key point in this application here is that you want the unique words. The point is that some words have both a positive and negative entry in the freqs dictionary, so if you just take all the words from the keys of the dictionary, there will be duplicates. That’s what we are using “set()” for here. You could implement that with a loop, but it would be more complicated code and would be (I assume) quite a bit slower. Mind you, I have no idea how the “set()” function is actually implemented, so maybe I’m wrong about the efficiency part of the argument. Hmmmm, we could test it out. But using set() is clearly very simple for us to write. Less code is better, right?

Suppose I gave you a python list of strings each of which is a word. How would you write the code to make sure that the list does not contain any duplicate words?

paulinpaloalto · June 27, 2024, 4:46pm

Right, think about what the entries in freqs are: they are pairs of (key, value) in which the key is a tuple of (word, sentiment) and value is the count for that key.

What you want here is the unique words from the keys, right?

Deepti_Prasad · June 27, 2024, 4:54pm

The whole basic of getting a set of unique words in vocabulary fits in with what Paul has added to explanation. Basically it create a new set using elements passed during the call. It takes iterable as an argument and returns a new set object.

Regards
DP

Nevermnd · June 27, 2024, 6:23pm

Okay, I think I see it now. Sometimes I still have a little trouble working out iterators in Python-- I know in some sense they are much easier/more flexible than some brute force indexing approach, but at the same time thinking in that way just comes more naturally to me I think.

But, yes you are right. I mean I was using len(set(…)), but then the keys are tuples and we only want the set of words, regardless of the associated sentiment. I figured it out.

Alex_Tu · July 2, 2024, 5:31am

So, how would you extract “word” from tuple (word, sentiment)?
Thx

Nevermnd · July 2, 2024, 12:10pm

You can go out the same way you came in:

>>> my_tuple = ("happy", "positive")
>>> word, _ = my_tuple
>>> print(word)
happy

paulinpaloalto · July 2, 2024, 2:28pm

There are lots of ways in python. Anthony has shown you one. A tuple can also be indexed:

word = key[0]

which selects the first element of the tuple. In this instance your best bet is to take that idea and create a python “enumeration” from that.

Alex_Tu · July 2, 2024, 8:54pm

Thanks!

Alex_Tu · July 2, 2024, 9:01pm

I đi what you and Nevermnd suggested. Now I got this error and I couldn’t figure it out after 4 hours debugging. Pls more hint. Thx

{moderator edit - solution code removed}

paulinpaloalto · July 2, 2024, 10:35pm

I added the print statement like yours to show the first 10 words in the vocabulary and here’s what I see:

['cash', 'breath', 'wan', 'balkan', 'sfvbeta', 'crowdfund', 'angle.nelson', 'brewproject', "else'", 'meee']
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
0.0
9165

Note that I have some other instrumentation in addition to just that.

So my freqs dictionary has the words in a different order. I’m not sure off hand whether that is in itself a problem. The first question is whether your count_tweets function passed the tests and whether you ran everything in order.

The code you show is a little different than mine and you’re working a little too hard, but let me try it your way and see what happens.

paulinpaloalto · July 2, 2024, 10:41pm

Hmmm. I wrote the code to generate the set the same way you did and it works for me. The words come out for me in the order I showed above. No change. So I think that means your count_tweets function works differently than mine.

There is one other problem we can see in your exception trace, though: the way you retrieve the freq_pos and freq_neg values in the loop over all the words is incorrect. Not every word will have an entry for both sentiments, so you need to return 0 when the key does not exist. The cleanest way to do that is to use the “get()” method on the freqs dictionary, but you can also use if statements.

I added “lawnmow” to my instrumentation and here are my new results (still using your version of the “set” code):

['cash', 'breath', 'wan', 'balkan', 'sfvbeta', 'crowdfund', 'angle.nelson', 'brewproject', "else'", 'meee']
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
freq_pos for lawnmow = 1
freq_neg for lawnmow = 0
loglikelihood for lawnmow = 0.6823294546700677
0.0
9165

Alex_Tu · July 2, 2024, 11:10pm

Yup. I found this overlook ust as I saw your comment. Thanks for great help

Topic		Replies	Views
Challenged with Unique Word Calculation for Vocabulary NLP with Classification and Vector Spaces week-2 , week-3	24	780	March 21, 2022
Possible bug in # UNQ_C7 GRADED FUNCTION: inserts NLP with Probabilistic Models week-1	6	495	July 24, 2023
How do I use the 'freqs' prebuilt dictionary to my advantage? NLP with Classification and Vector Spaces week-2 , week-3	2	440	June 27, 2023
Doubt in Week 2 coding assignments NLP with Classification and Vector Spaces week-2	9	94	October 22, 2024
Week 1 Exercise 9 Issue with using list vs set NLP with Probabilistic Models week-1	5	560	January 4, 2023

NLP - C1W2 - Use of set() in finding vocab

Related topics