Link to the lecture video in question: https://www.coursera.org/learn/custom-models-layers-loss-functions-with-tensorflow/lecture/iGjXg/contrastive-loss
The explanation given in the video for the contrastive loss function does not illustrate the motivation behind why the function is set up this way. Video states that " for similar images, we’re going to have a high value" because the “D squared” term dominates. And that the function should evaluate to “a much smaller value than D squared” when the images are dissimilar and the max term dominates.
Rather than trying to state which term evaluates to greater value for a given value of D, the explanation should delve into how the loss function changes as D increases in value for the similar vs dissimilar cases. The loss function increases in value as D increases when images given are similar. And the loss function decreases in value as D increases (for D < margin) when images are dissimilar.
The reason for this is because the desired mapping of images to feature vectors after training should result in small D values for similar images and large D values for dissimilar images.
Text from transcript of relevant portion of the video (starts at 1:24):
"… D is the tensor of Euclidean distances between the pairs of images.
Margin is a constant that we can use to enforce a minimum distance between them in order to consider them similar or different.
Let’s consider what happens when Y is one, and I replace the Y’s with one, so then this equation will be reduced down to D squared, so that we can see for similar images, we’re going to have a high value.
When Y is zero, and we sub this in for Y, then our value instead of D squared will be the max between the margin minus D or zero, which is then squared, and this should be a much smaller value than D squared.
You can think of the Y and one minus Y in this loss function as weights that will either allow the D squared of the max part of the formula to dominate the overall loss.
When Y is close to one, it gives more weight to the D squared term and less weight on the max term. The D squared term will dominate the calculation of the loss.
Conversely, when Y is closer to zero, this gives much more weight to the max term and less weight to the D squared term, so the max term dominates the calculation of the loss…"