C1W2 - Lecture Video - Contrastive Loss - Explanation Not Clear

Link to the lecture video in question: https://www.coursera.org/learn/custom-models-layers-loss-functions-with-tensorflow/lecture/iGjXg/contrastive-loss

The explanation given in the video for the contrastive loss function does not illustrate the motivation behind why the function is set up this way. Video states that " for similar images, we’re going to have a high value" because the “D squared” term dominates. And that the function should evaluate to “a much smaller value than D squared” when the images are dissimilar and the max term dominates.

Rather than trying to state which term evaluates to greater value for a given value of D, the explanation should delve into how the loss function changes as D increases in value for the similar vs dissimilar cases. The loss function increases in value as D increases when images given are similar. And the loss function decreases in value as D increases (for D < margin) when images are dissimilar.

The reason for this is because the desired mapping of images to feature vectors after training should result in small D values for similar images and large D values for dissimilar images.

Text from transcript of relevant portion of the video (starts at 1:24):

"… D is the tensor of Euclidean distances between the pairs of images.

Margin is a constant that we can use to enforce a minimum distance between them in order to consider them similar or different.

Let’s consider what happens when Y is one, and I replace the Y’s with one, so then this equation will be reduced down to D squared, so that we can see for similar images, we’re going to have a high value.

When Y is zero, and we sub this in for Y, then our value instead of D squared will be the max between the margin minus D or zero, which is then squared, and this should be a much smaller value than D squared.

You can think of the Y and one minus Y in this loss function as weights that will either allow the D squared of the max part of the formula to dominate the overall loss.

When Y is close to one, it gives more weight to the D squared term and less weight on the max term. The D squared term will dominate the calculation of the loss.

Conversely, when Y is closer to zero, this gives much more weight to the max term and less weight to the D squared term, so the max term dominates the calculation of the loss…"

Hi @classical_leap,
I can see your point that that part of the explanation could be confusing. To me, the important aspects of the contrastive loss formula is that:
1). One part of the formula (the D squared part) is for the case where Y=1, and the other (the max(margin-D,0) squared part) is for Y=0, where Y=1 means that we expect the two images to be similar, and Y=0 means we expect the images to be different
2). Our loss should be large if the distance, D, between the two images is far when we expected the images to be similar, and the loss should also be large if we expected the images to dis-similar, but the distance, D, between the two was small.

There’s a little more explanation of this in this old post: C1W2 -> understand the constrastive loss function

1 Like

Maybe “explanation not clear” was a squishy way to say what I mean. I mean to say the the explanation given in the video does not at all convey to the viewer what is essential about the contrastive loss equation and why that is important.

Similar to what you stated, the most important part of the contrastive loss equation to me is the way loss value changes with changing D depending on value of Y. Because this means the direction of the gradients calculated during back propagation is such that the training process should drive models to output small D values for similar inputs and large D values for dissimilar inputs.

That’s a nice, concise summary. In general, we want to minimize the loss as we train. In this case, that means we want to work towards reducing D when the two images are similar, and increasing D when the images are dissimilar, which this contrastive loss function will do.

I agree with you, I don’t really understand why the video ends up focusing on the relative size of the losses for the similar case vs the dissimilar case. He gets us most of the way there, explaining each part of the formula, and pointing out that one part is for the similar case and one part for the dissimilar case, but then I don’t know if he lost track, or hadn’t thought it completely through, or what, but it does seem like he gets to a point and thinks, “What else do I need to say”, and what pops into his head is to talk about how the size of the loss for the similar case is a bigger factor than the part for the dissimilar case. I can understand why his sub-conscious might have taken him there, because it is an interesting subtlety of the formula. It’s true than in the similar case, if D is large, we’ll have a large loss, which is what we want, because we want to train until we get down to a small D for similar images, but for the dissimilar case, as long as D is larger than the margin, we don’t really care how much larger it is, so the dissimilar case only contributes to the loss for the situation where D is smaller than the margin, which will, of course, tend to be a fairly small number.

In any case, thank you for bringing this up to help clarify the main points of this formula. It will be helpful for other students who might find the video similarly confusing.