Challenged with Unique Word Calculation for Vocabulary

Mentors and Students,

I am challenged with calculating the unique words in my dictionary “V” I am only three away from the number so I believe that my approach is correct but somehow a couple of words are getting shaved off.

My approach is:
use use the first element in the tuple key of freqs to supply the word. Use a list comprehension for that:
wordlist = [pair[0] for pair in freqs.keys()]
then do a set of pair[0] to take the list down to unique words.
vocab = set(wordlist)

I have no idea why that is not working. There does not seem to be anyplace where it can go wrong.

Any help would be appreciated.

John

That’s exactly what I did. Have you considered that maybe your freqs dictionary is wrong? That is the input to your calculation here. I.e. maybe the real bug is in count_tweets.

I added some instrumentation to my code to print out the various numbers:

V = 9165, len(wordlist) 11436
V: 9165, V_pos: 5804, V_neg: 5632, D: 8000, D_pos: 4000, D_neg: 4000, N_pos: 27547, N_neg: 27152
freq_pos for smile = 47
freq_neg for smile = 9
loglikelihood for smile = 1.5577981920239676
0.0
9165

What do you see for the length of the wordlist (the number of non-unique words)?

Paul,

Just finished dinner here on the east coast. Thanks for the help.

I will follow up on those recommendations and get back after troubleshooting. Yeah. It seemed foolproof. I calculated the keys in the word loop “word in process_tweet(tweet)” using “pair = (word,y)” so word should be correct because it was from the process_tweet(tweet) utility function.

I have a pretty linear set of calculations from tweet to unique word. The #s and elements for smiles you gave me give me a couple of ideas. Not doing quantum physics or anything so I can follow up. I do not have a whole lot of python time (6 months of tinkering) so I am trying to make things as linear as possible. Finished my last lab in the Stanford ML class last week so MATLAB/Python syntax and keyword confusion will be gone now and I will have more time for this class.

Well. Back on task
Thanks again.
John

Hi, John.

That all sounds completely reasonable, but I hope something will occur to you based on the diagnostics I showed. I feel your pain on the MATLAB to python conversion, although it’s now been quite a few years for me since I made the switch from Stanford ML to DLS.

Let me know how the debugging goes.

Cheers,
Paul

Not sure what the rules are for showing code but the count_tweets loop is:
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
# define the key, which is the word and label tuple
pair = (word,y)

the loop iterates through the words in tweet and then puts the y for the tweet with the word in the pair tuple that becomes the key. Here is the correct output from the tests:

{(‘happi’, 1): 1, (‘trick’, 0): 1, (‘sad’, 0): 1, (‘tire’, 0): 2}
Expected Output : {(‘happi’, 1): 1, (‘trick’, 0): 1, (‘sad’, 0): 1, (‘tire’, 0): 2}

Test your function

w2_unittest.test_count_tweets(count_tweets)
All tests passed

This is my troubleshooting calculation right after calling freqs.
image

You can see that my total is 9162 instead of 9165. I will trace things back into the utility function from there.

What is the length of the word list before you do the “set” operation?

Note that I showed the correct value above.

Also note that the test for count_tweets there is pretty limited. There are a multitude of sins that could be hidden there. :scream_cat:

1 Like

Following up on that.

11428 not the same as yours. Here is the output right after the call to freqs.

image

I went back into count_tweets and did not see any problems but I will dig more.

Paul,

I think my data is incorrect. Some of this work looked a lot like what I did last week so I checked the numbers that were calculated last week. Last week I did not get the same result in one of the calculations but I still got full credit. Perhaps when I downloaded the files something happened. This is from last week. I thought I did the same calculation last week.

image

The expected output was the same as yours and mine was the same as last week. 11428.

len(freqs) from last week is the same as len(wordlist)

Knocking off for the evening. 6.5 hours on this lab today… nuff is enough. Tomorrow is another day.

There are a couple of concerning comments there:

If you are running this stuff locally, then all bets are off. If you run it on the course website, there is literally no connection between the Week 1 and Week 2 assignments. The two trees of files are separate, right?

1 Like

I will run it on the course website from now on. For the past six months or so, I have set up a fully functional system. I will be doing systems engineering work on my own system as soon as I can start to become somewhat proficient with text processing. I have access to a large number of technical specifications and I want to process them as part of an MBSE workflow. That is why I am learning NLP. I need to be fully functional here but if that messes up the course, I will use the online system. I did the entire ML course using a local system, just downloading the starter files and then submitting the labs from my machine. That was using Octave though. OK. Tomorrow, I will copy my work into the Jupyter notebook online. Thanks.

Yes. I have separated each week’s work because I take notes on the lecture slides and keep data and ancillary references separate. I will use the online site though from now on, I am trying to build up my own toolkit to use as time goes on. Online learning is new to me. Finished with graduate school in 1992 and worked for 30 years as an aerospace engineer.

This week’s lectures were way too thin for my taste. I found what looks to be a good textbook on this subject so I read the first three chapters of it today. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf It goes over much of the first two weeks of this course in depth.

Paul,

Thanks. I am not sure that these DLS courses are going to scratch the itch I have to learn this material. It seems that the lectures are “teaching to the test/assignment” and that hard work on the assignments is going to be where I really learn. I know that debugging code is part of the work but I am really studying to learn NLP, not to be good at Python. That will come with time.

Does the depth of the instruction increase with courses 2, 3 and 4?

Honestly, I am not all that enthused about the lectures and lecture slides as references and I am going to have to consult external sources to get skills in Python, Math and ML/NLP that I feel I will need to be effective solving “real world problems.” That said… you Mentors are worth the money and time spent in the program. I got tons of help in the DL class and you have definitely helped me get back on track with this problem. I am trying to build a “quiver” of problem/solution applications and I am not sure I will get there applying 15+ hours a week to this specialization.

I am open to your advice and recommendation and will probably follow it. What should I do?

Very Best!
John

Used the online system and everything is working. For some reason, when I downloaded “all files” something did not stay consistent because everything worked except the unit tests failed in two places. When I have time, I will try to figure out why. As soon as I started running things online, everything worked fine and the tests passed. I am down to the last two sections of the lab now and after that initial chunk of code everything is plug and chug… just put in the equations. I am learning from this work just not as in-depth as I would like. Thanks. Your consult was priceless.

It’s great to hear that going online fixed the weird issues with train_naive_bayes. As to what the rest of NLP is like, I literally have not gotten beyond Week 2 of Course 1 of NLP. I think the real “meat” of NLP is the Sequence Model and Attention Model material that’s in Course 4 here and in Course 5 of DLS. That’s where the real action is. But since I haven’t gotten to that part of NLP or beyond Week 1 of DLS Course 5, I’m not really qualified to advise on that. But I think the example you found of the material from Stanford is a good way to go. They have been extremely generous about just publishing a lot of material from their graduate level CS courses about DL related topics. If you find the material here too shallow, that’s a good place to look. E.g. here’s the website for CS224n Natural Language Processing.

Also it’s good to hear that the mentor support has been helpful, but one other point to make there is that the mentors are volunteers. We do not get paid to do this. It’s just our way of “giving back” to the community. There’s nothing to prevent you or other students from contributing in a similar way. Even if you don’t have a “mentor” badge by your name, a useful answer will be recognized and appreciated.

1 Like

Paul,

Thanks very much for the support. Did not know you are not paid. Double thanks. I think that I will finish out this course since I have paid for it. I looked at the CS224n class and watched a number of Professor Manning’s lectures… That guy has a great sense of humor. All the assignments and most of the references are available online. I do not want to pay the $~1400.00 price or to try to get accepted into Stanford at my age and experience level. I would pay one of the TA’s to grade my work if it were legal. I really just want feedback and community. I am retired but I realized that I can work with my former colleagues to process old-school system and subsystem specifications with NLP to classify, summarize and even parse the documents into a format that would make systems modeling much more cost-effective. As someone who has build systems models from poorly written specifications for a dozen years or so I want to try inserting ML/NLP into the workflow. These courses are a means to that end and I might do just as well stealth auditing CS224n.

Cheers!
John

I have no visibility into what goes on with the business model between Coursera (the platform provider) and the course content providers (e.g. DLAI), but apparently the business model does not allow for any kind of paid support for the students. It’s a good point that I had forgotten that Stanford does actually allow people to take their courses online, but the price is (ahem) at a different level than Coursera. Some of that must reflect Stanford just trying to extract $$ for value, but it does give you at least some sense of the true cost of having actual TAs whose job it is to help you, grade your work and so forth. Maybe there is some truth to the old adage “you get what you pay for”, both in terms of the course content and the quality of the support.

I’m also retired and spent my career doing software engineering and engineering management (mostly in operating systems internals working for computer systems companies), but my academic background is in math. What got me interested in ML/DL was when my son signed up for Prof Ng’s original Stanford Machine Learning course and asked if I’d be interested to take it along with him. I was completely hooked: it seemed like the perfect intersection of math and CS. So far, I haven’t made it beyond taking classes and helping fellow students, but I salute your project of actually applying NLP techniques to a real world problem with which you have significant domain knowledge. Maybe if you audit CS224 you can either find or create an online community of fellow students. I’d have to believe there are other folks in the same position: wanting to take CS224, but not wanting to pay the big bucks. A little googling might turn up something.

1 Like