# Practical advice on single value evaluation metric

Hi Fellow learners and Mentors,

As, I am making progress through the assignments, side by side I am working on building my own implementations based on what I learned so far. In course 3, professor Andrew discussed evaluating the performance of the algorithm using a single value metric on the dev set.

Based on what I understood, I have modelled an implementation for deriving the prediction accuracy, and especially the F1 score.

def prediction_accuracy_metric(predicted_labels, true_labels, num_examples):
np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
predicted_labels[predicted_labels > 0.5] = 1
predicted_labels[predicted_labels <= 0.5] = 0
true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_poisitves + false_negatives)
f1 = 2 * (precision * recall) / (precision + recall)
percentage_accuracy = np.sum(predicted_labels == true_labels)/num_examples
return percentage_accuracy, f1

Also, the following is an example of the same from one of the runs of my implementation after making 2500 iterations of optimization.

Training set prediction accuracy: 0.9234449760765551, F1 score: 0.8840579710144927

Dev set prediction accuracy: 0.76, F1 score: 0.8181818181818182

When I compare the training set and the dev set in terms of only prediction accuracy, I get the idea, that I have a variance problem there. I have overfitted the training set and I should try higher values for regularization parameter or try dropout (which would be a less recommended approach compared to L2 regularization).

But then I look at the F1 scores and get confused because of how close they are. Hence I suspect the correctness of my implementation above.

Thanks & Regards,
Chandan.

Your F1-score implementation for binary problem looks fine. I’m not sure about your percentage_accuracy implementation though because eval_labels and eval_data in the code above are some global variables.

Note that you can also simplify calculations a little:

np.sum(predicted_labels == 1) # <- (true_positives + false_positives)
np.sum(true_labels == 1) # <- (true_poisitves + false_negatives)

Hi @manifest , thanks a lot for the response. Actually, my method definition was a bit different and while pasting the same here in discourse as a function, I made some changes and missed a few. This is the original definition.

def compute_evaluation_metric(self, predicted_labels: np.ndarray, true_labels: np.ndarray, num_examples: int)  -> Tuple[np.float_, np.float_]:
np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
predicted_labels[predicted_labels > 0.5] = 1
predicted_labels[predicted_labels <= 0.5] = 0
true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1 = 2 * (precision * recall) / (precision + recall)
return np.sum(predicted_labels == true_labels) / num_examples, f1

And also thanks a lot for the simplification ideas. They make sense and I would consider those changes in my code.

What I was also interested in, how to correctly interpret the evaluation metrics. They still remain the same. And I am thinking, that when there is a larger gap between the prediction accuracy between train and dev set but not so much for F1 score, what does that mean ?

One possibility perhaps is, that I have some error in my implementation. I should probably look into this and also consider manual error analysis to understand, where the classification is not working.

Initially, I thought, that my evaluation metrics implementation is wrong. But as you suggested, that they are okay, which means I perhaps have some error someplace else.

It seems that you may have an issue in your implementation of accuracy. You calculate F1-score on the mini-batch, but compare it with accuracy calculated differently. I’m not sure what num_examples there for.

Thanks a lot for the suggestions. I will take a step back and check my implementation.