C2_W3_Assignment: Exercise 9 - estimate_probability

echohui · February 9, 2024, 2:56am

Link to the Programming Assignment: Autocomplete

Description:

Hello everyone,

I’m currently working through Exercise 9 - estimate_probability, and I noticed something that seems a bit confusing regarding the vocabulary size, |V|, calculation. The code snippet provided for testing calculates |V| using len(unique_words), which doesn’t include the start <s> and end <e> tokens. My understanding is that these tokens should also be part of our vocabulary. Consequently, this adjustment would change the expected estimated probability of P(cat | a) from the given 0.3333 to 0.2727.

Here’s the code snippet for reference:

# test your code
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

tmp_prob = estimate_probability("cat", ["a"], unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

Additionally, later in the notebook, the estimated probability of P(cat | a) is indeed 0.2727 when using the probability matrix. Could there be a typo in Exercise 9, or am I missing something in how the vocabulary size should be calculated?

probability matrix

Thank you!

TMosh · February 9, 2024, 3:20am

As I understand it, the start and end tokens are not part of the vocabulary, since they are used only as control characters for framing, and are not part of the language.

arvyzukai · February 9, 2024, 7:27am

Hi @echohui

They are not required for the test case which purpose is to check your implementation of estimate_probability().

Your understanding is correct that they (and also the <unk>) should be included in the “vocabulary” (the variable). But when estimating n-gram probabilities, the <s> special token should not be included - it would never be predicted by the model, so the probability for it (even after smoothing) should be 0.
In other words, this token is “involved” in calculating n-gram conditional probabilities (for example, P(cat|<s>)) thus it should be part of vocabulary. But it is never the token to be predicted (for example, P(<s>|cat)).

Usually (almost always) they are part of the “vocabulary” (the variable that the models work with) but, as you say, usually for control purposes (for example, so that ChatGPT would know when to stop generating, and there’s always a probability associated with the <e>).
As for being part of the “language”, it’s a more philosophical question, but I would argue that they are part of the language too. Our (human) “vocabulary” is not limited to the “words” we speak as we also have the body language, the pauses etc. and these could be modeled or “represented” symbolically. For example, when we speak with someone and that someone’s sentence ended with a long pause, this usually would mean that they generated the “<e>” token and now it’s our turn to speak.

Cheers

TMosh · February 9, 2024, 8:47am

Indeed. Language has a lot of complex aspects.

Topic		Replies	Views
For the Exercise 9, anyone gets below answer? 0.1111 rather than 0.3333 NLP with Probabilistic Models week-3	4	604	September 19, 2022
Confused regarding UNQ_C10 NLP with Probabilistic Models week-3	2	527	March 14, 2023
C2_W3 calculate_perplexity NLP with Probabilistic Models week-3	9	206	April 27, 2024
Video on Starting and Ending Sentences (Intuition issue) NLP with Probabilistic Models week-2	1	377	September 15, 2023
C2_W3_E10_perplexity NLP with Probabilistic Models week-3	5	294	March 18, 2024

C2_W3_Assignment: Exercise 9 - estimate_probability

Related topics