In the lectures, Andrew Ng defines the triplet loss by taking the difference between the output vectors, then calculating the L2 norm, and squaring that.
Why did we just take the simple difference instead of calculating, say, euclidean distance?
In the lectures, Andrew Ng defines the triplet loss by taking the difference between the output vectors, then calculating the L2 norm, and squaring that.
Why did we just take the simple difference instead of calculating, say, euclidean distance?
The “squared norm” is computationally easier, since you don’t have to compute the square root.
It’s not the “simple difference”. The 2-norm is the Euclidean length of a vector. So what he shows is the square of the Euclidean length of the difference vector. The point is that Prof Ng is just showing the mathematical formula there. You would not write the code to compute the 2-norm and then square it, because (as Tom points out) you’d be wasting the computation needed to compute the square root (relatively expensive) and then squaring the result (relatively cheap). You would just compute the sum of the squares of the differences, which is the first step in computing it the long way. But that gives you the answer as Prof Ng has specified it above.
Thank you for your answers both Paul and Tom. I guess I am wondering why we calculate the similarity of the two vectors as such, when we could use something like cosine similarity as well.
It’s a good question, but I don’t know the answer. Of course the vectors we are comparing are “embeddings” in the sense of semantic embedding: unit vectors in a space in which the dimensions represent the strength of various (learned) attributes of a face. So just using the Euclidean distance seems reasonable as a method of measuring the similarity of two embeddings. But this is an experimental science, right? You can take the model and switch to using cosine similiarity as your cost function and then compare the results. If your version works better, write it up!
Or before you invest that effort, the simpler approach would be to read some of the references that are given at the end of this assignment. Maybe they comment in the papers about which distance functions they considered and why they made the choices that they used.
Yes, I considered both those approaches - I even looked into the FaceNet paper, and saw that they reference another paper for the determination of their distance function (which was way over my head.).
Thanks for clarifying - I feel satisfied now
Nathan