I got 60/60 on the assignment but I ran into an odd processing issue when running the same code on my machine. I suspect the issue had to do with capitalization.
I imported all auxiliary files to run Week 2’s assignment on my machine and it output error:
Traceback (most recent call last):
File "C1_W2_Assignment.py", line 176, in <module>
w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)
File "/Users/myuser/coursera/nlp/course1/w2_unittest.py", line 367, in test_train_naive_bayes
if np.isclose(result2[key], value):
KeyError: 'े'
Just before, one of the test failed with output:
Wrong number of keys in loglikelihood dictionary.
Expected: 9165.
Got: 9160.
Debugging this test failure for the function train_naive_bayes
, I made sure that vocab
was sorted to compare on the online notebook vs my machine. In the final loop of the function, I printed out all ~9,165 words with a counter:
vocab_word_counter = 0
for word in vocab:
vocab_word_counter +=1
print(vocab_word_counter, word)
I’ve attached some of the output below in a post-script. In a quick inspection, though, it seemed to process certain items differently on my machine and on the jupyter notebook.
It seemed like items :D
, :P
, :d
were all distinct on jupyter but they only exist as :d
, :p
on my machine. In fact, I can make the test fail on jupyter if I lower()
every word in the vocab.
Somehow, in processing the words on my machine, they were being lowercased. I searched the whole directory for a rogue lower()
with no success.
I then went a step back and, before defining train_naive_bayes
, I ran
print(len(freqs))
which yielded the expected 11436
online but 11429
on my machine.
freqs
is output by count_tweets
, which receives all tweets in train_x
as input, and then runs them through process_tweet()
. It seemed like one of the functions called there is lowercasing at least certain inputs. The most obvious culprit was TweetTokenizer()
but the code everywhere runs as TweetTokenizer(preserve_case=False...)
.
I updated my macOS 12.1, conda 4.11 environment from python 3.8.8 to python 3.10 and nltk-3.6.5. The only changes were that freqs
had a length of 11,428 and the number of keys in the loglikelihood dictionary is 9,162.
I then thought of running nltk.__version__
to figure out the coursera NLTK version. It is nltk=3.4.5
which requires python=3.8.12
. I downgraded to NLTK 3.4.5 with conda, which also downgraded python.
Now it works. 😮💨
P.S. Sample output:
jupyter:
431 9pm
432 :'(
433 :')
434 :'d
435 :(
436 :)
437 :-(
438 :-)
439 :-d
440 :/
441 ::
442 :D
443 :P
444 :\
445 :p
446 :|
447 ;(
448 ;)
449 ;-)
450 ;p
451 ;}
452 <---
453 <3
454 =:
455 =D
456 >:(
457 >:)
458 >:-(
459 >:d
460 @artofsleepingin
my machine:
431 9pm
432 :'(
433 :')
434 :'d
435 :(
436 :)
437 :-(
438 :-)
439 :-d
440 :/
441 ::
442 :\
443 :d
444 :p
445 :|
446 ;(
447 ;)
448 ;-)
449 ;p
450 ;}
451 <---
452 <3
453 =:
454 =d
455 >:(
456 >:)
457 >:-(
458 >:d
459 @artofsleepingin