W3A2 Trigger_word_detection_v2a Dev set other metric testing

(Self tinkering on the exercise)

In 2.3 - Test the Model
The text explains that the accuracy metric is a poor metric in this case because the labels are scewed.
It suggests that one should use F1 or PrecisionAtRecal (or that is what I think is meant with precision/recal) but then says we won’t bother with it, which triggered me to do it anyway x)

The build in Keras function:
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/F1Score
Does not seem available. (I guess a newer version of keras would need to be installed)
I found this code:

To self build the metric. I am unfortunately not knowledgeable enough yet to know if this is a correct implementation but it is accepted and has some upvotes. I also used PrecisionAtRecal as this is available from keras directly in this version as another reference point to compare against the non validated F1 code

I tried both this and PrecisionAtRecal on the model.
During the additional training epochs I get an:
F1Score of 0.62 - 0.65
PrecisionAtRecal(recal=0.8) of 0.48 - 0.6
PrecisionAtRecal(recal=0.5) of 0.73 - 0.8

However on the dev set I only get the following:
F1Score of 0.23
PrecisionAtRecal(recal=0.8) of 0.20
PrecisionAtRecal(recal=0.5) of 0.30

The scores seem quite low so did I mess up somewhere?

If not is it correct to interpret these results as:

  1. The model is over-fitting on the extra testing epochs (maybe due to the small dataset or the synthesized extra data we created already being part of the larger dataset used to pre-train the model)?
  2. The accuracy is indeed very misleading because the model is not doing that well according to the other metrics on the dev set?

The empyrical results we look at however did fairly well.
It also did quite well on my own test although I had little background noise

These 2 would indeed indicate the model is doing quite well, so that leads me to believe I may not be understanding the metrics correctly?

Please help me understand what this means:

Keras3 has Precision and Recall.
It’s hard to repeat all the details from lectures on model debugging. Do brush up on optimizing and satisficing metrics to decide on how you want to measure model performance.

F1 score is a good start for skewed labels. Assuming that there are no mistakes with the implementation, I’d say that the model is underfitting the dataset due to low F1 score. Here are 2 pointers based on measures like accuracy (assuming similar data distributions across splits):

  1. If bayes measure (i.e. ideal target) is high when compared train set measure, we have underfitting.
  2. If train set measure is high when compared to dev set measure, we have overfitting.