C3_W1_Assignment：Exercise 1

_LIU_TUNG-HUA_C54091 · May 12, 2023, 5:43pm

I’ve been trying to convert tweets into tensors for a NLP project I’m working on. I’ve written a function that tokenizes the tweet, removes stop words and punctuation, applies stemming, and then converts each word into its unique integer ID from a vocabulary dictionary. If a word is not in the dictionary, it should be replaced with the ID of an unknown token ‘UNK’.

Here is my attempts:

Tokenization: The input tweet is first converted to lowercase using tweet.lower(), and then tokenized into individual words using word_tokenize(tweet).
Stop Words and Punctuation: The list of English stop words is obtained using stopwords.words('english'), and the set of punctuation symbols is obtained using set(string.punctuation).
Removed Words: A combined list of words to be removed is created by adding the stop words and punctuation symbols together using list comprehensions.
Stemming: A PorterStemmer object is instantiated for stemming words in the tweet.
Processing the Tweet: The code iterates through each word in the tokenized tweet using a for loop. Each word is stemmed before it’s checked against the removed words list. If the current word is not in the removed words list, it is added to the word_l list.Additionally, there’s a special case for emoticons, such as “:)”. If the current word is a colon and the next word is a parenthesis, the two characters are combined and added to the word_l list as a single element.

However, I’m running into some issues with the outputs not matching the expected values:
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 2, 304, 567, 56, 9].
Got: [2, 444, 2, 304, 2, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 3, 304, 567, 56, 9].
Got: [3, 444, 3, 304, 3, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [60, 2992, 2, 22, 236, 1292, 45, 1354, 118].
Got: [2, 60, 2992, 2, 22, 236, 1292, 45, 1354].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [-1, -10, 2, -22, 236, 1292, 45, -4531, 118].
Got: [2, -1, -10, 2, -22, 236, 1292, 45, -4531].
4 Tests passed
4 Tests failed

paulinpaloalto · May 13, 2023, 4:14am

Note that you filed this under NLP Course 1 Week 1, but the title says Course 3 Week 1. I am not familiar at all with NLP C3, but what you show does not look anything like NLP C1 W1 Logistic Regression.

I know nothing about the problem you are working on, but it looks like there is a pattern to the errors you are making. Note that your answers agree with the “Expected” values, except that you insert an extra token as the first value in each of your generated results. I hope that would be a clue as to the nature of your error. Sounds like some variation of an “off by one” error.

Topic		Replies	Views
Tweet_to_tensor function NLP with Sequence Models week-module-1	3	517	June 19, 2023
C3-W1 Count Errors Natural Language Processing in TensorFlow week-module-1	3	686	July 28, 2022
C3_W1 small typo in notebook guidance - process_tweets NLP with Sequence Models week-module-1	1	529	July 1, 2022
Cannot remove stop words properly in Ex1 NLP with Sequence Models week-module-1	2	510	September 16, 2022
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-module-1	4	664	June 27, 2022

C3_W1_Assignment：Exercise 1

Related topics