I went back and looked at the actual notebook again and scanned the lecture on Triplet Loss and there is one relevant point to the question that is raised here: notice that in the definition of the Triplet Loss function there is a hyperparameter \alpha that Prof Ng calls the “margin” between the positive and negative cases. In both the lecture and the assignment, the value of \alpha is chosen as 0.2. But also notice that in definition of the function, he uses the square of the norms of the differences of the embedding vectors. So the goal of the training is to create a gap of at least 0.2 between the positive cases (two different pictures of the same person) and the negative case (pictures of two different people). It makes sense to use the squared values for training, since that saves you a lot of compute. It’s exactly analogous to using MSE instead of Euclidean Distance: the square root doesn’t give you any advantage but adds complexity and compute cost.
But then when we use the embeddings in practice, we are using the actual norms, not the square of the norms. So if the difference of the squares is at least 0.2, we’d would expect the differences of the norms to be \geq \sqrt{0.2} = 0.447... or thereabouts.
I didn’t see anyplace where he discusses why 0.2 was chosen as the margin value.
From a practical standpoint, I wrote a little piece of test code to see how the differences play out in the small database that they gave us. Here’s my added cell in the notebook:
# Experiment with distances
younes_vec = img_to_encoding("images/camera_0.jpg",FRmodel)
younes_oppo = -1. * younes_vec
print(f"distance to opposite = {np.linalg.norm(younes_vec - younes_oppo)}")
# Loop over the database dictionary's names and encodings.
for (name, db_enc) in database.items():
dist = np.linalg.norm(younes_vec - db_enc)
cos_sim = np.squeeze(np.dot(younes_vec, db_enc.T))
print(f"younes to {name}: dist {dist} cos_sim {cos_sim}")
Running that gives the following output:
distance to opposite = 1.9999998807907104
younes to danielle: dist 1.2834293842315674 cos_sim 0.17640449106693268
younes to younes: dist 0.599294900894165 cos_sim 0.8204227685928345
younes to tian: dist 1.430235743522644 cos_sim -0.022787034511566162
younes to andrew: dist 1.368172287940979 cos_sim 0.06405222415924072
younes to kian: dist 1.3116600513458252 cos_sim 0.13977383077144623
younes to dan: dist 1.3604931831359863 cos_sim 0.07452907413244247
younes to sebastiano: dist 1.377026081085205 cos_sim 0.051899492740631104
younes to bertrand: dist 1.4408819675445557 cos_sim -0.038070425391197205
younes to kevin: dist 1.2082229852676392 cos_sim 0.27009856700897217
younes to felix: dist 1.3881206512451172 cos_sim 0.0365605354309082
younes to benoit: dist 1.4173320531845093 cos_sim -0.004415145143866539
younes to arnaud: dist 1.3324687480926514 cos_sim 0.11226345598697662
The salient points there are that the distance between the two different pictures of Younes is about 0.6 and the minimum of the distance to any of the different people is about 1.2. So at least in this one very limited case, you can see there is a workable gap between the “yes” and “no” answers and that 0.7 is a good threshold value to make the distinction.
One other note is that it occurred to me to wonder why they didn’t use the cosine similarity between the embedding vectors instead of the Euclidean distance between them. I don’t remember Prof Ng commenting on that either. Later in Course 5 when we use word embeddings, they frequently use cosine similarity rather than the vector differences. You can see that there is also a very clear gap between the “yes” case (cos \approx 0.82) and all the other “no” cases (max(cos) \approx 0.27). Of course a higher value is better in that case, meaning that the vectors point closer to the same direction. My guess is that the reason for using Euclidean distance rather than cosine similarity is that it’s cheaper to compute. So if it works (and apparently it does in the face recognition case), that would be preferable. Maybe in the later word embedding cases in C5 they discovered that Euclidean distance is not sufficient to drive the training.