Question on calculating perplexity (C3W2)

Hi @Terwiliger

I think you understand that correctly (but just in case how to calculate log_p):

  • preds is of shape (32, 64, 256) - (32 lines of text, 64 characters, probability of each character)
  • target shape is (32, 64) - (32 lines of text, 64 characters)
  • with tl.one_hot you temporary reshape the target to (32, 64, 256) - (32 lines of text, 64 characters, probability of each character with 1 where is the actual character, 0s everywhere else)
  • you elementwise multiply preds with temporary reshaped `target’ - this way you get only the probabilities for the actual target characters (and do not care about the others)
  • you sum over the last dimension to get back to (32, 64) (the shape of log_p) - this way you got the model’s probabilities for the actual target characters

Yes, after multiplying log_p with non_pad you get rid of the padding. Loosly speaking - in last dimension (64) you set some number of last characters’ values to 0.

  1. because np.sum() would result in a scalar - a single value, but you need a sum over an axis (note the hint: # Remember to set the axis properly when summing up. In other words, you need the sum over the last axis - the characters in each line and not over all characters in all lines.) So the “nominator” should be the sum of non-padded probabilities. (to be more detailed: a vector (a 1D tensor of length 32) of the sum of non-padded probabilities for each line, or concretely - sum of log_p over the last axis).

  2. 32 would be the totally wrong number. 64 would be closer but 64 does not account for padded words. Your “denominator” needs to be a number of non-padded characters in each line. So to get it you should sum non_pad (2D tensor 32x64) over last axis to get the number of non-padded characters in each line (a 1D tensor of length 32).

  3. When you elementwise divide the “nominator” with the “denominator” (in quotes because they are 1D tensors) you get 32 average log probs for each line (1D tensor of lenth 32).

Lastly you take a mean of between these lines (1D tensor of length 32 reduces to a scalar). (This was the 1/N part). Since the return statement negates (-log_ppx) this scalar is the value of your log perplexity.

I hope I wasn’t too detailed with my response :slight_smile:
Cheers

9 Likes