Adding start sequence tags and perplexity calculation

I completed my assignment for week 3 but I could not understand why there were following differences between the lecture and assignments.

  1. UNQ_C8 GRADED FUNCTION: count_n_grams adds start tokens equal to length of n-grams instead of n-1 start tokens as mentioned in the lecture. The reason given in comments is not clear to me.

  2. UNQ_C10 GRADED FUNCTION: calculate_perplexity calculates N = len(sentence) which includes start_token and end token. However, the lecture says that N should not include the start token. So why does assignment not follow it?

Hi @Ritu_Pande

Regarding point 1, there’s an explanation:

Take a look to the ('<s>', '<s>') element in the bi-gram dictionary. Although for a bi-gram you will only require one starting mark, as in the element ('<s>', 'i'), this ('<s>', '<s>') element will be helpful when computing the probabilities using tri-grams (the corresponding count will be used as denominator).

In simple words - the reason is practical purposes for the implementation of the whole assignment.

Regarding point 2, I’m not sure why, but if I had to guess - to not overly complicate the assignment.

In general, perplexity scores should not include start tokens or any other special tokens but accounting for that might be too complicated for learners to complete the assignment.

Cheers

Thank you for the explanations

Hi arvyzukai, I am not sure if it is possible, but if possible, can you provide feedback to the course creators to give a rationale in comments for the assignment regarding point 2. It had me question my understanding of perplexity calculations :slight_smile: and might have same effect of the other learners as well.

Hi @Ritu_Pande

Yes, sure, I thing there should not be a problem to include a sentence mentioning that. Excellent questions by the way!

Cheers