when watching the video about the triplet loss for face recognition, Andrew explained that we maximize the loss with 0 because we dont care how much we differ as long as its bigger than the margin. I was wondering: Wouldnt it be beneficial for training to let the loss become negative in case the model performs very good? To me it looks that we are rather “incentivizing” the model to only perform as good as needed here.
optimizing model to minimize the triplet loss ensures that the distance between our anchor and negative representations is at least margin =“alpha” higher than the distance between our anchor and positive representations. This allows us to learn an embedding space where our anchor and positive representations are close and the anchor and negative representations are farther apart.
So while computing the gradient, updating the metric and return loss as negative can cause anchor-negative distance to be too large causing exploding gradient, so choosing a margin of 0 optimises the triplet loss function there by preventing the update metrics within optimal limits.
Intuitively speaking, I think the job of a classification task is to define a classification boundary that separates two classes. In other words, as long as the P and N samples are differed by the margin, then, to this good pair, the boundary is clear and the job is done! However, if we further push the good pairs apart, would we then create more bad pairs or would we fail to reduce more bad pairs?
Yes - we are incentivizing to only perform sufficiently good, but we are also asking it to focus on bad pairs (aka semi-hard and hard triplets) by ignoring the good ones (aka easy negatives).
If you have read about “imbalanced dataset”, you know that we might want to, for example, upsample the minor class to avoid the network from biasing towards the major one. Here, with triplet loss, we are taking out those good pairs so, effectively, we are “upsampling” the bad pairs such that the training can focus on them.