C2_W3_Assignment: Exercise 9 - estimate_probability

  • Link to the Programming Assignment: Autocomplete

Description:

Hello everyone,

I’m currently working through Exercise 9 - estimate_probability, and I noticed something that seems a bit confusing regarding the vocabulary size, |V|, calculation. The code snippet provided for testing calculates |V| using len(unique_words), which doesn’t include the start <s> and end <e> tokens. My understanding is that these tokens should also be part of our vocabulary. Consequently, this adjustment would change the expected estimated probability of P(cat | a) from the given 0.3333 to 0.2727.

Here’s the code snippet for reference:

# test your code
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

tmp_prob = estimate_probability("cat", ["a"], unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

Additionally, later in the notebook, the estimated probability of P(cat | a) is indeed 0.2727 when using the probability matrix. Could there be a typo in Exercise 9, or am I missing something in how the vocabulary size should be calculated?

probability matrix

Thank you!

As I understand it, the start and end tokens are not part of the vocabulary, since they are used only as control characters for framing, and are not part of the language.

Hi @echohui

They are not required for the test case which purpose is to check your implementation of estimate_probability().

Your understanding is correct that they (and also the <unk>) should be included in the “vocabulary” (the variable). But when estimating n-gram probabilities, the <s> special token should not be included - it would never be predicted by the model, so the probability for it (even after smoothing) should be 0.
In other words, this token is “involved” in calculating n-gram conditional probabilities (for example, P(cat|<s>)) thus it should be part of vocabulary. But it is never the token to be predicted (for example, P(<s>|cat)).

Usually (almost always) they are part of the “vocabulary” (the variable that the models work with) but, as you say, usually for control purposes (for example, so that ChatGPT would know when to stop generating, and there’s always a probability associated with the <e>).
As for being part of the “language”, it’s a more philosophical question, but I would argue that they are part of the language too. Our (human) “vocabulary” is not limited to the “words” we speak as we also have the body language, the pauses etc. and these could be modeled or “represented” symbolically. For example, when we speak with someone and that someone’s sentence ended with a long pause, this usually would mean that they generated the “<e>” token and now it’s our turn to speak.

Cheers

Indeed. Language has a lot of complex aspects.