C3W1_Practice_Assignment

Hello,

I’m having trouble getting the unit tests to pass on my build_vocabulary function. I put in some print statements so see what was going on. Here is my output:

corpus: [[‘a’]]
word is: a
vocab value: 1
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: {‘’: 0, ‘[UNK]’: 1, ‘a’: 2}.
Got: {‘’: 0, ‘[UNK]’: 1, ‘a’: 1}.
corpus: [[‘a’, ‘aa’], [‘a’, ‘ab’], [‘ccc’]]
word is: a
vocab value: 1
word is: aa
vocab value: 1
word is: a
vocab value: 2
word is: ab
vocab value: 1
word is: ccc
vocab value: 1
Output does not match with expected values. Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.
Expected: {‘’: 0, ‘[UNK]’: 1, ‘a’: 2, ‘aa’: 3, ‘ab’: 4, ‘ccc’: 5}.
Got: {‘’: 0, ‘[UNK]’: 1, ‘a’: 2, ‘aa’: 1, ‘ab’: 1, ‘ccc’: 1}.
2 Tests passed
2 Tests failed

From this it looks to me like my output is correct and the expected output is wrong. What am I misunderstanding?

Thanks,
Katrina

Hi @Katrina_Montinola

I would suspect you’re using global variable train_x inside you function instead of the corpus parameter of the function.

Isn’t that the case?

Thanks for the reply. Unfortunately, not. As you can see, I print out the corpus to make sure. What exactly does this message mean: “Maybe you can check the value you are using for unk_token variable. Also, try to avoid using the global dictionary Vocab.” I don’t see where the unk_token is used in the code and I don’t even know what the global dictionary Vocab is.

I know we aren’t supposed to put in our code in the messages, but this is a Practice Assignment so I hope it’s ok. Can someone tell me what I’m doing wrong in this code?
Thanks very much! I’m an experienced C coder from a million years ago, but new to Python and AI/ML.

GRADED FUNCTION: build_vocabulary

def build_vocabulary(corpus):
‘’‘Function that builds a vocabulary from the given corpus
Input:
- corpus (list): the corpus
Output:
- vocab (dict): Dictionary of all the words in the corpus.
The keys are the words and the values are integers.
‘’’

# The vocabulary includes special tokens like padding token and token for unknown words
# Keys are words and values are distinct integers (increasing by one from 0)
vocab = {'': 0, '[UNK]': 1} 

### START CODE HERE ###
print('corpus: ', corpus)
# For each tweet in the training set
for tweet in corpus:
    #print('tweet: ', tweet)
    # For each word in the tweet
    for word in tweet:
        #print('word: ', word)
        # If the word is not in vocabulary yet, add it to vocabulary
        if word not in vocab:
            vocab.setdefault(word, 1)
        else:
            vocab[word] += 1
        if word in ["a", "ab", "aa", "ccc"]:
            print('word is: ', word)
            print('vocab value: ', vocab[word])
        
### END CODE HERE ###

return vocab

vocab = build_vocabulary(train_x)
num_words = len(vocab)

print(f"Vocabulary contains {num_words} words\n")
print(vocab)

Hi @Katrina_Montinola

Ok, you’re overthinking a bit. Here is what the docstring says:

'''Function that builds a vocabulary from the given corpus
Input: 
    - corpus (list): the corpus
Output:
    - vocab (dict): Dictionary of all the words in the corpus.
            The keys are the words and the values are integers.
'''

It’s not very specific but the expected output should give you a hint what is wanted from you:

The dictionary Vocab will look like this:

{‘’: 0,
‘[UNK]’: 1,
‘followfriday’: 2,
‘top’: 3,
‘engage’: 4,

In other words, when you encounter a word for the first time, you assign it an incremented index (one simple way of achieving that would be len(vocab))

What your code does is different - you assign index of 1 for every new word, while if the word is in the dictionary you increment it’s index. This does not result in neither incrementing nor unique values.

I hope that clears the confusion.
Cheers

Ah! Thank you very much!!! I guess I misunderstood what Vocab should contain. Every word in the dictionary should just be assigned the next integer. Instead of how many times in the corpus the word appears.

Regards,
Katrina