Transformer Network Application: Question Answering Lab

I had a question on the TensorFlow and PyTorch code comparisons. In TF2, we had two loss functions and had the training minimize the average of the two loss functions. Why was the approach in PyTorch different? In PyTorch, we used a different metric (F1 score) and instead of trying to minimize a loss we tried to maximize the metric?

Sorry, I haven’t looked at the labs yet.

But in general, the F1 score is used as a metric when the data set is highly skewed (there are lots of “False” examples and very few “True” examples).

If you use the cost value with a skewed data set, there will not be much incentive for the system to learn to predict the True cases, because it can get very low cost by only predicting the False ones.

The F1 score does a better job of balancing the predictions for both False and True.