The course materials define perplexity using M (number of sentences) in the exponent, but the quiz seems to expect the per-word version using N (total number of tokens). Per-word convention (dividing by N) is used in NLP textbooks like Jurafsky & Martin’s “Speech and Language Processing.”
Would it be possible to clarify these? Thank you so much for your time!
sorry for the delayed response @Yujing_Wang1
can you share the link about what you are mentioning, so I can cross-check and get back to you.
Perplexity is defined as the exponentiated average log-likelihood of a sequence normalized per token (N) rather than per sentence (M)
But per sentence convention can also be used depending on query to context window response. for example question-answer based transformer model can use this approach.
While formulas may differ in documentation, the NLP convention from Jurafsky & Martin uses the total tokens (N) in the exponent to represent the average geometric probability per token.
- Per-Word Convention (N token)
PP(W)= P(w_1, w_2, …w_n)^-1/N
calculating the average branching factor per word, which is standard when comparing models across different text lengths.
- Per-Sentence Convention (M sentence) - in some context, if M represents the number of sentences, and each sentence is treated as a single unit, it might appear differently, but the underlying measure of predicting individual tokens will remain similar. This approach probably can be used for Q&A based text output, where chunking of text from query to response can be taken as sentence.
If you query in particular to anything else, then kindly share what part of course section this query is being asked, or feel free to ask if any further doubt.
regards
Dr. Deepti