C1_W1_Assignment's word frequencies

In Course 1, week 1’s assignment notebook C1_W1_Assignment, we are asked to extract features for each tweet by implementing the extract_features function. One of the inputs is the freqs dictionary, which should have keys such as (word, 1.0) for positive and (word, 0.0) for negative frequencies. Do we expect all word in the keys to be lower-case?

I’m asking because when I implemented the extract_features function, if I looked up work frequencies by looking for keys like (word, 1.0), then it gave the expect result (and all the following cells worked correctly). However, if I looked up keys like (word.lower(), 1.0), which includes an additional step of converting all words to lower-case, then it led to small differences between the calculate and expected costs when running gradient descent a few cells below.

I thought all the words should already be lower-case so the additional .lower() call shouldn’t have an effect, but seems like it’s not the case? And looking at the function process_tweet, there’s nothing there to ensure everything’s in lower-case.

Just want to confirm that I’m not missing anything. The notebook instruction does mention that we should be careful about the cases, and that’s why I added the .lower() call in the first try but that didn’t work for me.

Thanks!

Hi @Kuang-Han_Huang

This is a good question and if I remember correctly the issue is related to smiley faces. For example, :P and :p or :D and :d would become the same “word” and you would have slightly different behavior.

I could be wrong but I suspect this might be the answer to your question. In any case, it’s a good practice to question or check things like you did when doing tokenization.

Cheers

1 Like

Ah yes smileys make sense as “words” that shouldn’t always be lower-case. Thanks!

Yes, the smileys sound like a great theory. Note that this is an experimental science: it would be pretty straightforward to instrument the code and figure out which values are changed by the addition of the “to lower” operation. :nerd_face:

In the lectures I think they do discuss how the words are standardized by stemming and converting to lower case and all that, but it’s been a long time since I watched them and I honestly forget what they say about smileys. :grin: