60/60 on assignment but trouble running on own machine due to NLTK bug

I got 60/60 on the assignment but I ran into an odd processing issue when running the same code on my machine. I suspect the issue had to do with capitalization.

I imported all auxiliary files to run Week 2’s assignment on my machine and it output error:

Traceback (most recent call last):
  File "C1_W2_Assignment.py", line 176, in <module>
    w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)
  File "/Users/myuser/coursera/nlp/course1/w2_unittest.py", line 367, in test_train_naive_bayes
    if np.isclose(result2[key], value):
KeyError: 'े'

Just before, one of the test failed with output:

Wrong number of keys in loglikelihood dictionary. 
    Expected: 9165.
    Got: 9160.

Debugging this test failure for the function train_naive_bayes, I made sure that vocab was sorted to compare on the online notebook vs my machine. In the final loop of the function, I printed out all ~9,165 words with a counter:

    vocab_word_counter = 0
    for word in vocab:
        vocab_word_counter +=1
        print(vocab_word_counter, word)

I’ve attached some of the output below in a post-script. In a quick inspection, though, it seemed to process certain items differently on my machine and on the jupyter notebook.

It seemed like items :D, :P, :d were all distinct on jupyter but they only exist as :d, :p on my machine. In fact, I can make the test fail on jupyter if I lower() every word in the vocab.

Somehow, in processing the words on my machine, they were being lowercased. I searched the whole directory for a rogue lower() with no success.

I then went a step back and, before defining train_naive_bayes, I ran

print(len(freqs))

which yielded the expected 11436 online but 11429 on my machine.

freqs is output by count_tweets, which receives all tweets in train_x as input, and then runs them through process_tweet(). It seemed like one of the functions called there is lowercasing at least certain inputs. The most obvious culprit was TweetTokenizer() but the code everywhere runs as TweetTokenizer(preserve_case=False...).

I updated my macOS 12.1, conda 4.11 environment from python 3.8.8 to python 3.10 and nltk-3.6.5. The only changes were that freqs had a length of 11,428 and the number of keys in the loglikelihood dictionary is 9,162.

I then thought of running nltk.__version__ to figure out the coursera NLTK version. It is nltk=3.4.5 which requires python=3.8.12. I downgraded to NLTK 3.4.5 with conda, which also downgraded python.

Now it works. 😮‍💨

P.S. Sample output:

jupyter:

431 9pm
432 :'(
433 :')
434 :'d
435 :(
436 :)
437 :-(
438 :-)
439 :-d
440 :/
441 ::
442 :D
443 :P
444 :\
445 :p
446 :|
447 ;(
448 ;)
449 ;-)
450 ;p
451 ;}
452 <---
453 <3
454 =:
455 =D
456 >:(
457 >:)
458 >:-(
459 >:d
460 @artofsleepingin

my machine:

431 9pm
432 :'(
433 :')
434 :'d
435 :(
436 :)
437 :-(
438 :-)
439 :-d
440 :/
441 ::
442 :\
443 :d
444 :p
445 :|
446 ;(
447 ;)
448 ;-)
449 ;p
450 ;}
451 <---
452 <3
453 =:
454 =d
455 >:(
456 >:)
457 >:-(
458 >:d
459 @artofsleepingin

Hi, amp.

Yes, package management is an issue (to put it mildly) in Data Science even with conda. I would not call it though an “NLTK bug” as in the title.
It’s nice to see that you eventually managed to reproduce the results on your machine and shared your steps with others.

Thanks!

1 Like

Maybe. But I don’t think NLTK would intend for a tokenizing function (much less its case-preserving feature) to perform differently between versions 3.4.5 and 3.6.5. I strongly suspect this to be a bug. I’ll look into this more and probably submit an issue on github.

For the purposes of the assignment, though, this might come up again with others who first test things on their own machines if their NLTK versions are different.

Well, I would argue that they intended the particular change in the behavior.
The change came from version NLTK 3.6.5 (I suspect) when they started supporting emoji ZJW sequences. (NLTK :: Release Notes)

rawtext: ‘:smiley: :stuck_out_tongue: :d’
The prior behavior:

In: process_tweet(':D :P :d')
Out: [‘:D’, ‘:P’, ‘:d’]

The later behavior:

In: process_tweet(':D :P :d')
Out: [‘:d’, ‘:p’, ‘:d’]

One could argue if :stuck_out_tongue: (:P) is the same as :stuck_out_tongue: (:p) but I think this was intentional.

Anyways, we agree on the main point - others who implement the assignment on their own machines should be vary of these kind of caveats. :+1:

2 Likes

You’re right! Thank you for looking into this! So not a capitalization issue, but rather an emoji-processing one.

FWIW, some other assignments along the specialization seem to be running in other NLTK versions. So far, 3.4.5 and 3.5, which produce slightly different results for pre-processing.

@amp You are right. It is the C2_W4_Assignment. I have described this issue in this post.

There should be some way out. As such there are also some obsolescence between Python versions also. But the issues with nltk vesions is more devastating. There should be some way out.

@arvyzukai & @amp I notice that the stemmer in the process_tweet in utils.py is lowercasing :smiley: to :d in the nltk 3.6.5.

1 Like

Yes @Shantimohan_Elchuri , nltk 3.6.5 started treating :D and :d as the same emoji (and also others like :P and :p). This is the reason why dictionary tests fail - there are less “words” than expected. If you want to reproduce the results without failing (to have the same number or “words”) on your own machine, you should downgrade your nltk version to 3.4.5