Triplet Loss Backpropagation


So, I understand the Triplet loss function and what it is aiming to do. Consider that I want to implement a Siamese Network completely from scratch. By completely, I mean no TensorFlow or PyTorch.

In Triplet Loss, we have a total of 3 predictions involved: f(A), f(P) and f(N). Similar to a normal loss function which takes the true labels and the predicted labels, we need to calculate the derivatives of the loss with respect to each prediction, i.e dL/dfA, dL/dfP and dL/dfN.

As I understand, we have only one neural network with a (say) 128-dimensional dense layer as the final output. My question is how are the three gradients backpropagated to this single layer? A Dense layer has a single output and hence, expects a single gradient to be backpropagated. How do I handle these three gradients?

Note: I’ve looked around and have been unable to find anything concrete. Most of the resources just say “it is backpropagated.” My questions is about the internal mechanics of it all. Also, I know how to calculate the gradients, so no issues with that.


The gradients for the Dense Layer work the same way they always do. It’s just a linear (affine) transformation followed by an activation function. Of course you need to back propagate through all the earlier conv layers, pooling layers, batch norm and whatever else there is in the network. The only part that is new here is the loss function. But you have the formula for the loss function, right? And if you know how to take gradients, you surely know the Chain Rule. You have the function. Go for it!

I think I have understood it now. It being a Siamese network, there are multiple “branches” of the same base CNN. Each gradient is fed to the branch which generated the prediction that was used to calculate the gradient.

1 Like