Exercise 4: compute accuracy

There are two ways to get the accuracy of the predictions (after taking care of padding):

  1. Get mean of the accuracy of predictions from each sentence.
  2. Consider the entire batch of sentences as one big set and calculate accuracy of that complete set in a single operation.

Both might give different numerical result. Which one is correct? I think the grader considers the second one as correct. But can someone clarify why that is the case.


For each sentence (after taking care of padding), you need to find the number of correctly predicted tags and divide this number by the total number of unpadded tokens. This will provide you the accuracy on a single sentence.

You can perform the above operation by iterating on each sentence or on the entire batch at once using vectorization. Both methods will provide you the same result.

I understand and it’s also the same as the second method that I mentioned in my original post. I’d like to know why (and if) the first method is incorrect. Let’s take a small example here:

matches = [[True True PAD],
[[True False False],
[[True False True]]

where True means the predicted label is the same as the target label and False means the predicted label is not the same as the target label. PAD is the part of sentence to ignore.

Now accuracy based on the two methods that we’re discussing:

  1. mean accuracy from individual sentence accuracy
    acc1 = 2/2
    acc2 = 1/3
    acc3 = 2/3
    final_acc = (acc1 + acc2 + acc3)/3 = 0.67

  2. accuracy of all the sentences as a whole
    final_acc = 5/8 = 0.625

I can think of computation speedup as one reason why we’d prefer the second option but from the POV of correctness, I’m still not clear why we’d prefer the second option.

First method is correct and should be used to compute the accuracy on a single sentence as well as on the batch of sentences.

I think the second method is not the correct way to compute accuracy. I will discuss this with the team.

1 Like

I am having problems with computing accuracy. I believe I have used argmax() properly. However I am having problems building the mask. I’m assuming when I build the mask, I am checking output against the pad? As a part of the error, I’m getting

Blockquote Wrong output: Pad token is being considered in accuracy calculation. Make sure to apply the mask…

What should I be reading as reference? I tried masking like the previous class assignment. What am I missing?

just in case, my lab id is whacbnpw


PAD token will have a unique index. You need to remove the effect of padding while computing accuracy.

You can practice masking by manually generating a random array of numbers and then masking different numbers to see the results. You can use NumPy for this.

Hi @SainiAnkit

I looked the variables in a debugger. I believe I should be masking labels since I see padding there. Doing this gives the same values as the expected output. Still I very confused over how to compare the predictions to this mask. I’m looking at the previous assignment.