Issue in running C2_W4_Assignment on a personal PC

When I run this assignment on a personal PC, though cell #23 has errors, it doesn’t stop and the notebook runs till the end. But in cell #23, the w4_unittest.test_gradient_descent passes only 2 tests and 14 tests failed.

When I trace back, I find that at the output of cell #3, the Number of tokens is 60976 on my pc whereas it was 60996. This was resulting in vocabulary size of 5775 on my PC whereas it was 5778 on Coursera server. I was getting the same results on my Mac also.

Mean while I downgraded nltk from ver 3.6.5 to 3.4.5 to troubleshoot similar error that I was getting for C1_W2_Assignment as suggested in this forum post. Now for the C2_W4_Assignment I got Number of tokens as 60975 and vocabulary size as 5777.

When I upgraded nltk ver from 3.4.5 to the version 3.5 (as in the Coursera servers for this assignment) the unit test passed as I got the same numbers as in Coursera Server.

So first thing I request is for any of you to confirm this by running the C2_W4_Assignment on your personal machines.

Second is to suggest what exactly do we have to do to run this assignment with nltk ver 3.6.5 which is the latest one. Definitely different nltk versions are not backward compatible. But there must be a way to process the things correctly. With 60K+ tokens it will be laborious process to compare the generated tokens between the two versions.

ThanQ… in advance…

You’re right about the difference in vocab sizes.
Since coursera allows the labs to be downloaded, it might be good to list all the dependencies for the lab via a requirements.txt file so that people can start creating and using their own virtual environments. @Community-Team to look into this.

All tests are based on a particular environment setup. So, you can still run your code on nltk 3.6.5 but the tests are going to fail till the tests are updated.

If you have nltk version 3.6.5 then to run this assignment to pass all tests in Cell #23 add the following lines to cell #3 after reading data from the shakespeare.txt file:

data = data.replace(“stol’n”, “stol ’ n”)
data = data.replace(“Ev’n”, “Ev ’ n”)
data = data.replace(“fall’n”, “fall ’ n”)

As compared to nltk 3.5 (on the Coursera server) this results in Number of tokens = 60992 which is 4 less than that got on the Coursera server. But final results will be same as the Vocabulary size will be same (= 5778).

Between nltk versions 3.5 and 3.6.5, these 3 words are treated differently. In nltk 3.5 these words are each broken into 3 distinct tokens. Where as in nltk 3.6.5 these 3 words each result in single token. I feel what nltk 3.6.5 doing is correct. However the isalpha() function used next eliminates them altogether.

So I feel the function w4_unittest.test_gradient_descent(…) should be updated for the expected counts.


1 Like

Thank you for letting us know nltk == 3.5.* works for the unit test.