Practical advice on single value evaluation metric

Hi Fellow learners and Mentors,

As, I am making progress through the assignments, side by side I am working on building my own implementations based on what I learned so far. In course 3, professor Andrew discussed evaluating the performance of the algorithm using a single value metric on the dev set.

Based on what I understood, I have modelled an implementation for deriving the prediction accuracy, and especially the F1 score.

    def prediction_accuracy_metric(predicted_labels, true_labels, num_examples): 
        np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
        predicted_labels[predicted_labels > 0.5] = 1
        predicted_labels[predicted_labels <= 0.5] = 0
        true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
        false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
        false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_poisitves + false_negatives)
        f1 = 2 * (precision * recall) / (precision + recall)
        percentage_accuracy = np.sum(predicted_labels == true_labels)/num_examples
        return percentage_accuracy, f1

Could someone please advise me, if this implementation looks appropriate?

Also, the following is an example of the same from one of the runs of my implementation after making 2500 iterations of optimization.

Training set prediction accuracy: 0.9234449760765551, F1 score: 0.8840579710144927

Dev set prediction accuracy: 0.76, F1 score: 0.8181818181818182

When I compare the training set and the dev set in terms of only prediction accuracy, I get the idea, that I have a variance problem there. I have overfitted the training set and I should try higher values for regularization parameter or try dropout (which would be a less recommended approach compared to L2 regularization).

But then I look at the F1 scores and get confused because of how close they are. Hence I suspect the correctness of my implementation above.

Could someone please provide me with some advice regarding this ?

Thanks & Regards,
Chandan.

Hey @chandan1986.sarkar,

Your F1-score implementation for binary problem looks fine. I’m not sure about your percentage_accuracy implementation though because eval_labels and eval_data in the code above are some global variables.

Note that you can also simplify calculations a little:

np.sum(predicted_labels == 1) # <- (true_positives + false_positives)
np.sum(true_labels == 1) # <- (true_poisitves + false_negatives)

Hi @manifest , thanks a lot for the response. Actually, my method definition was a bit different and while pasting the same here in discourse as a function, I made some changes and missed a few. This is the original definition.

    def compute_evaluation_metric(self, predicted_labels: np.ndarray, true_labels: np.ndarray, num_examples: int)  -> Tuple[np.float_, np.float_]:
        np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
        predicted_labels[predicted_labels > 0.5] = 1
        predicted_labels[predicted_labels <= 0.5] = 0
        true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
        false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
        false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_positives + false_negatives)
        f1 = 2 * (precision * recall) / (precision + recall)
        return np.sum(predicted_labels == true_labels) / num_examples, f1

And also thanks a lot for the simplification ideas. They make sense and I would consider those changes in my code.

What I was also interested in, how to correctly interpret the evaluation metrics. They still remain the same. And I am thinking, that when there is a larger gap between the prediction accuracy between train and dev set but not so much for F1 score, what does that mean ?

One possibility perhaps is, that I have some error in my implementation. I should probably look into this and also consider manual error analysis to understand, where the classification is not working.

Initially, I thought, that my evaluation metrics implementation is wrong. But as you suggested, that they are okay, which means I perhaps have some error someplace else.

It seems that you may have an issue in your implementation of accuracy. You calculate F1-score on the mini-batch, but compare it with accuracy calculated differently. I’m not sure what num_examples there for.

Thanks a lot for the suggestions. I will take a step back and check my implementation.