I’m really struggling to understand some components of calculating perplexity in UNQ_C5.

This is my understanding of what we should do -

Calculate log(p) by first projecting the 2nd dimension of target onto a 3rd dimension of 0s and 1s to match the shape of preds; then multiply that last dimension to remove the predictions for the non-one hot element

This should result in a 32x64 array, with 32 sentences, each with 64 word placeholders, and the value there being the pred value for the element at target

Strip out the padding with a 0/1 mask

Sum the log_p values for the words in each sentence, then divide that array (32 elements) by -1/N

Take the mean of this.

Where this breaks down for me, I think, is that I don’t understand how to implement this line: log_ppx = np.sum(None, None) / np.sum(None, None)

why wouldn’t we just do np.sum() / 32?

I may also be having some challenges understanding earlier steps of this.

I think you understand that correctly (but just in case how to calculate log_p):

preds is of shape (32, 64, 256) - (32 lines of text, 64 characters, probability of each character)

target shape is (32, 64) - (32 lines of text, 64 characters)

with tl.one_hot you temporary reshape the target to (32, 64, 256) - (32 lines of text, 64 characters, probability of each character with 1 where is the actual character, 0s everywhere else)

you elementwise multiply preds with temporary reshaped `target’ - this way you get only the probabilities for the actual target characters (and do not care about the others)

you sum over the last dimension to get back to (32, 64) (the shape of log_p) - this way you got the model’s probabilities for the actual target characters

Yes, after multiplying log_p with non_pad you get rid of the padding. Loosly speaking - in last dimension (64) you set some number of last characters’ values to 0.

because np.sum() would result in a scalar - a single value, but you need a sum over an axis (note the hint: # Remember to set the axis properly when summing up. In other words, you need the sum over the last axis - the characters in each line and not over all characters in all lines.) So the “nominator” should be the sum of non-padded probabilities. (to be more detailed: a vector (a 1D tensor of length 32) of the sum of non-padded probabilities for each line, or concretely - sum of log_p over the last axis).

32 would be the totally wrong number. 64 would be closer but 64 does not account for padded words. Your “denominator” needs to be a number of non-padded characters in each line. So to get it you should sum non_pad (2D tensor 32x64) over last axis to get the number of non-padded characters in each line (a 1D tensor of length 32).

When you elementwise divide the “nominator” with the “denominator” (in quotes because they are 1D tensors) you get 32 average log probs for each line (1D tensor of lenth 32).

Lastly you take a mean of between these lines (1D tensor of length 32 reduces to a scalar). (This was the 1/N part). Since the return statement negates (-log_ppx) this scalar is the value of your log perplexity.

I hope I wasn’t too detailed with my response
Cheers

That is so incredibly helpful! I didn’t even think about the denominator being the number of elements in that sentence, but now that you point it out it makes complete sense. Thank you!