I understand about using mask to get rid of pad tokens and keep only actual predictions. So the number of correct predictions should be `np.sum(outputs * mask == labels * mask)`

. But what should be the total number of predictions then? I tried `np.sum(mask)`

, but got a wrong accuracy, larger than 100%.

Printing out my number of predictions shows that it is smaller than the number of correct predictions, so using `np.sum(mask)`

should be wrong. But why? What should be the correct answer then?

My code, FYI:

```
mask = (labels != pad)
n_correct = np.sum(outputs * mask == labels * mask)
n_prediction = np.sum(mask)
print("no. of correct predictions:", n_correct)
print("total actual predictions:", n_prediction)
accuracy = n_correct / n_prediction
```