C3 Assignment 3 E4 Problem with understanding evaluate_prediction


I am stuck on figuring out accuracy in step #3. I don’t understand how the masking is used. I am looking at exercise 5 in assignment 3. I also do an mask = np.where(label != pad, ) to get the mask. I am not sure why a different approach is used in assignment 3, exercise 5.

Assuming I am creating the mask properly, I don’t understand how to apply the mask so I can make the comparison between the labels and the predictions. Really I need a simple explanation. There is something simple that I am missing.

lab id whacbnpw


1 Like

Each traning / evaluation batch is of shape (num_examples, max_len). max_len represents the length of the longest sentence in the batch.
Sentences whose length is less than max_len gets padded to fit the batch.

Model prediction is of form (num_examples, max_len, prob_per_ner_class).
When you find outputs, it has shape (num_examples, max_len)


Upon making predictions, we want to compare positions that were not padded. This is what mask is used for.

1 Like

@balaji.ambresh Need help with how to do the masking. I am using mask = pred[:,:,1] == pad but getting all the values as false

1 Like

pred contains probability of each class.
You have to use argmax to get the prediction classes i.e. outputs.

Use labels to get the mask. To check if each element in labels is equal to padding token id, labels == pad is sufficient.


Hi @balaji.ambresh

Thanks for your help. I played around a bit and got the answer. It is how the mask interacts with the outputs and labels I didn’t quite get. And why the mask is the denominator and not the labels. I really have to look at it more so I fully understand.


1 Like

@balaji.ambresh , then while computing accuracy why is it wrong if we compute accuracy using – np.sum(outputs * mask ==labels * mask)/np.sum(mask)

Why is the accuracy correct using [code removed -moderator]

The numerator for accuracy should compare non padded elements, which we will obtain by multiplying the mask

1 Like


We don’t want to count the padded positions when calculating accuracy. Consider the example below. We want to compare only the first 3 positions. The numerator should be 2 and not 5. Here’s a block of code that should clear things for you:

pad_id = 3000
labels = np.array([1, 2, 4, 3000, 3000, 3000])
outputs = np.array([1, 2, 3, 10, 10, 10]) # this is the model outputs after argmax
print(mask) # array([ True,  True,  True, False, False, False])
print(labels * mask) # array([1, 2, 4, 0, 0, 0])
print(outputs * mask) # array([1, 2, 3, 0, 0, 0])
print(np.sum(labels * mask == outputs * mask)) # 5

The reason why your approach would work is because the output padded token will never be predicted by the model. The model predicts upto 17 labels whereas the padded token value is 35180. So, you are implicitly comparing upto padded lengths. Accuracy is in range [0, 1]. Hope this clears the purpose of mask.

1 Like

hello, I am completely lost at what the mask is doing, The assignment says that the mask needs to be the same size as the output ( I understand this since it has to be the same size of the body that has the text and padding) but when I tried mask = outputs == pad. It did not work but when I tried mask = labels == pad. It worked and I passed all tests, but how? the labels should not even have padding right? the padding is only something we add before feeding it to the model, so the target values we are comparing against should not be padded? or am I missing something here? I am completely lost.

1 Like

Adding @arvyzukai

Hi @zakharymg

The mask helps with “real” accuracy - indicates where pad characters are so that we would not account for them.

Example sentence:
“The correct prediction <pad> <pad> <pad>”,

If we want to get the model’s accuracy, we need to compare only the words and not the <pad> tokens. So the 100% accurate model would predict 3/3 correct (not 6/6, nor 3/6 nor something else).

For example:

  • bad model’s prediction (outputs after argmax):
    “The wrong words <pad> <pad> <pad>”
  • reasonably good model’s prediction (outputs after argmax):
    “This is correct sentence <pad> <pad>”
  • True labels:
    “This is the correct sentence <pad>”

I think this trivial example shows you why you should be counting pads in the labels but not in the outputs, if you want to get the accuracy - the 100% accurate model would get 5/5 correct, the reasonably good model would get 4/5 correct (but not 4/4).

Padding is needed for mini batch processing. If you want to feed a model more than just one sentence a time it needs to be structured as a matrix (where each column has a value (be it a actual value or a <pad> value)). This is the requirement for these types of models to work with mini batch processing (you cannot do matrix multiplication (inputs x weights) if one of the matrix (inputs) is not a matrix :slight_smile:).


1 Like