Practical advice on single value evaluation metric

chandan1986.sarkar · June 23, 2021, 7:47pm

Hi Fellow learners and Mentors,

As, I am making progress through the assignments, side by side I am working on building my own implementations based on what I learned so far. In course 3, professor Andrew discussed evaluating the performance of the algorithm using a single value metric on the dev set.

Based on what I understood, I have modelled an implementation for deriving the prediction accuracy, and especially the F1 score.

    def prediction_accuracy_metric(predicted_labels, true_labels, num_examples): 
        np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
        predicted_labels[predicted_labels > 0.5] = 1
        predicted_labels[predicted_labels <= 0.5] = 0
        true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
        false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
        false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_poisitves + false_negatives)
        f1 = 2 * (precision * recall) / (precision + recall)
        percentage_accuracy = np.sum(predicted_labels == true_labels)/num_examples
        return percentage_accuracy, f1

Could someone please advise me, if this implementation looks appropriate?

Also, the following is an example of the same from one of the runs of my implementation after making 2500 iterations of optimization.

Training set prediction accuracy: 0.9234449760765551, F1 score: 0.8840579710144927

Dev set prediction accuracy: 0.76, F1 score: 0.8181818181818182

When I compare the training set and the dev set in terms of only prediction accuracy, I get the idea, that I have a variance problem there. I have overfitted the training set and I should try higher values for regularization parameter or try dropout (which would be a less recommended approach compared to L2 regularization).

But then I look at the F1 scores and get confused because of how close they are. Hence I suspect the correctness of my implementation above.

Could someone please provide me with some advice regarding this ?

Thanks & Regards,
Chandan.

manifest · June 24, 2021, 7:34am

Hey @chandan1986.sarkar,

Your F1-score implementation for binary problem looks fine. I’m not sure about your percentage_accuracy implementation though because eval_labels and eval_data in the code above are some global variables.

Note that you can also simplify calculations a little:

np.sum(predicted_labels == 1) # <- (true_positives + false_positives)
np.sum(true_labels == 1) # <- (true_poisitves + false_negatives)

chandan1986.sarkar · June 24, 2021, 9:12am

Hi @manifest , thanks a lot for the response. Actually, my method definition was a bit different and while pasting the same here in discourse as a function, I made some changes and missed a few. This is the original definition.

    def compute_evaluation_metric(self, predicted_labels: np.ndarray, true_labels: np.ndarray, num_examples: int)  -> Tuple[np.float_, np.float_]:
        np.testing.assert_equal(predicted_labels.shape, true_labels.shape)
        predicted_labels[predicted_labels > 0.5] = 1
        predicted_labels[predicted_labels <= 0.5] = 0
        true_positives = np.sum((predicted_labels == 1) & (true_labels == 1))
        false_positives = np.sum((predicted_labels == 1) & (true_labels == 0))
        false_negatives = np.sum((predicted_labels == 0) & (true_labels == 1))
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_positives + false_negatives)
        f1 = 2 * (precision * recall) / (precision + recall)
        return np.sum(predicted_labels == true_labels) / num_examples, f1

And also thanks a lot for the simplification ideas. They make sense and I would consider those changes in my code.

What I was also interested in, how to correctly interpret the evaluation metrics. They still remain the same. And I am thinking, that when there is a larger gap between the prediction accuracy between train and dev set but not so much for F1 score, what does that mean ?

One possibility perhaps is, that I have some error in my implementation. I should probably look into this and also consider manual error analysis to understand, where the classification is not working.

Initially, I thought, that my evaluation metrics implementation is wrong. But as you suggested, that they are okay, which means I perhaps have some error someplace else.

manifest · June 25, 2021, 6:44am

It seems that you may have an issue in your implementation of accuracy. You calculate F1-score on the mini-batch, but compare it with accuracy calculated differently. I’m not sure what num_examples there for.

chandan1986.sarkar · June 25, 2021, 9:06am

Thanks a lot for the suggestions. I will take a step back and check my implementation.

Topic		Replies	Views
Metric and loss Structuring Machine Learning Projects coursera-platform	10	594	June 5, 2021
General question - accuracy function Neural Networks and Deep Learning coursera-platform	4	532	January 9, 2022
Choosing metric for a binary classification (sentiment analysis) problem? how to use Binary Accuracy? AI Discussions	5	63	April 6, 2022
W3A2 Trigger_word_detection_v2a Dev set other metric testing Sequence Models week-module-3 , coursera-platform	1	198	April 1, 2024
Deep learning Course4 A2 : metric problems Convolutional Neural Networks coursera-platform	2	667	July 4, 2021

Practical advice on single value evaluation metric

Related topics