I am confused after watching the video on using the Triplet Loss function for training the Siamese network. The loss function is computed after each forward pass through the network. Since the input to the network consists of a single image, how can the loss be a function of three images?

Triplet loss can be formulated using 3 data points or 3 images in this case. First, you have the anchor which is the data you are using to training the model on. You can think of it the usual training sample. Then you have two additional data, one that you know is positive (positive == anchor) and other that you know is negative (negative != anchor). These can be chosen from the same training set.

You want to train the model so that the distance between the positive and anchor is much shorter than the one between the negative and the anchor.

This loss function can be only applicable when you have multiple data points for the each class. You always have to have at least one positive sample that matches the anchor.

Thank you, but you are not answering my question. I understand how the triplet loss is formulated. What I don’t understand is: since only the anchor image is fed to the input of the network, how come you have the feature vectors for your positive and your negative images available on the output for computing the loss function?

It looks like I have to answer my own question. The explanation of Andrew seems to be missing a very important detail, which is that the network architecture consists of three copies of the same network and a concatenation layer. The details can be seen here: Create a Siamese Network with Triplet Loss in Keras | RUOCHI.AI. Thus, the network has three inputs, so that the anchor, the positive and the negative are all fed into the network.

Thanks for posting this. I came looking with the exact same question after watching the lecture. From the webpage you’ve linked, it seems you initiate the model once, and then for each 1 forward-prop, you’ll run 3 backward-props.