I’ve been trying to convert tweets into tensors for a NLP project I’m working on. I’ve written a function that tokenizes the tweet, removes stop words and punctuation, applies stemming, and then converts each word into its unique integer ID from a vocabulary dictionary. If a word is not in the dictionary, it should be replaced with the ID of an unknown token ‘UNK’.
Here is my attempts:
-
Tokenization: The input tweet is first converted to lowercase using
tweet.lower()
, and then tokenized into individual words usingword_tokenize(tweet)
. -
Stop Words and Punctuation: The list of English stop words is obtained using
stopwords.words('english')
, and the set of punctuation symbols is obtained usingset(string.punctuation)
. - Removed Words: A combined list of words to be removed is created by adding the stop words and punctuation symbols together using list comprehensions.
- Stemming: A PorterStemmer object is instantiated for stemming words in the tweet.
-
Processing the Tweet: The code iterates through each word in the tokenized tweet using a for loop. Each word is stemmed before it’s checked against the removed words list. If the current word is not in the removed words list, it is added to the
word_l
list.Additionally, there’s a special case for emoticons, such as “:)”. If the current word is a colon and the next word is a parenthesis, the two characters are combined and added to theword_l
list as a single element.
However, I’m running into some issues with the outputs not matching the expected values:
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 2, 304, 567, 56, 9].
Got: [2, 444, 2, 304, 2, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [444, 3, 304, 567, 56, 9].
Got: [3, 444, 3, 304, 3, 56, 9].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [60, 2992, 2, 22, 236, 1292, 45, 1354, 118].
Got: [2, 60, 2992, 2, 22, 236, 1292, 45, 1354].
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: [-1, -10, 2, -22, 236, 1292, 45, -4531, 118].
Got: [2, -1, -10, 2, -22, 236, 1292, 45, -4531].
4 Tests passed
4 Tests failed