C3_W1_Assignment:Exercise 1

I’ve been trying to convert tweets into tensors for a NLP project I’m working on. I’ve written a function that tokenizes the tweet, removes stop words and punctuation, applies stemming, and then converts each word into its unique integer ID from a vocabulary dictionary. If a word is not in the dictionary, it should be replaced with the ID of an unknown token ‘UNK’.

Here is my attempts:

  1. Tokenization: The input tweet is first converted to lowercase using tweet.lower(), and then tokenized into individual words using word_tokenize(tweet).
  2. Stop Words and Punctuation: The list of English stop words is obtained using stopwords.words('english'), and the set of punctuation symbols is obtained using set(string.punctuation).
  3. Removed Words: A combined list of words to be removed is created by adding the stop words and punctuation symbols together using list comprehensions.
  4. Stemming: A PorterStemmer object is instantiated for stemming words in the tweet.
  5. Processing the Tweet: The code iterates through each word in the tokenized tweet using a for loop. Each word is stemmed before it’s checked against the removed words list. If the current word is not in the removed words list, it is added to the word_l list.Additionally, there’s a special case for emoticons, such as “:)”. If the current word is a colon and the next word is a parenthesis, the two characters are combined and added to the word_l list as a single element.

However, I’m running into some issues with the outputs not matching the expected values:
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 2, 304, 567, 56, 9].
Got: [2, 444, 2, 304, 2, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 3, 304, 567, 56, 9].
Got: [3, 444, 3, 304, 3, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [60, 2992, 2, 22, 236, 1292, 45, 1354, 118].
Got: [2, 60, 2992, 2, 22, 236, 1292, 45, 1354].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [-1, -10, 2, -22, 236, 1292, 45, -4531, 118].
Got: [2, -1, -10, 2, -22, 236, 1292, 45, -4531].
4 Tests passed
4 Tests failed

Note that you filed this under NLP Course 1 Week 1, but the title says Course 3 Week 1. I am not familiar at all with NLP C3, but what you show does not look anything like NLP C1 W1 Logistic Regression.

I know nothing about the problem you are working on, but it looks like there is a pattern to the errors you are making. Note that your answers agree with the “Expected” values, except that you insert an extra token as the first value in each of your generated results. I hope that would be a clue as to the nature of your error. Sounds like some variation of an “off by one” error.