I would suggest a residual analysis with the metric that describes your business problem in the best way and working with:
a training set,
(a validation set) - in brackets since you seem to have ~25 labels if I interpret your last plot correctly and assumed is it complete concerning your labels
and a test set (that was never seen by the model before).
I believe here you will find a thread that describes how to do that: