How to choose the distance threshold=0.7 for face verification problem in the lecture W4 course Convolutional Neural Networks?

Hi all,

I have recently developed a face recognition app. In this application, I have to set a distance threshold to decide if a pair of face images is similar or unsimilar. However, I dont know how the value 0.7 is chosen in the Jupiter notebook.
Can anyone please help me with this?

Many thanks,
Yen

Blockquote

I think it means there is 70% difference in the probability of each of the images, which is reasonable to pick so. Anything above 50% and this is actually based on the sigmoid function would yield a true outcome else false!

To be on the safe side they have chosen a 70% would output a True.

They don’t really explain in the notebook how they came up with 0.7 as the comparison threshold for two faces matching. Note that those values are not probabilities: they are not the output of sigmoid. It is the Euclidean norm of the distance between the two face encoding vectors, which are 128 element vectors normalized to have length 1. If you have two vectors of length 1, the Euclidean length of the difference between them will be a number between 0 and 2. How you assign a “close enough” value is not obvious to me on general principles. My assumption is that they did some experimentation by comparing the value on multiple different images of the same face to see what the range of those values typically is.

Of course I’m just theorizing here, so take the above with appropriate dosage of salt. It’s been several years since I listened to the relevant lectures from Prof Ng on this subject, so I do not remember if he discusses this point anywhere. If you’ve viewed the lectures recently and remember if he did, please give us the video name and time offset.

But if you are building an app like this, you must have a training dataset. You can run the experiment yourself by selecting several cases in which you have multiple images in the training set of the same person’s face. Take the set of images of the same face and compute the max length of the distance between any of those two vectors. Then try that with pairs of images from two different people and compute the minimum of the distances in that case. What we have to hope is that we can find a clear gap between the max distances for the same person versus the minimum distance between two different people. If that turns out not to be true, then I’d argue it means our algorithm is flawed and we need to go back to the drawing board. :scream_cat:

1 Like

I went back and looked at the actual notebook again and scanned the lecture on Triplet Loss and there is one relevant point to the question that is raised here: notice that in the definition of the Triplet Loss function there is a hyperparameter \alpha that Prof Ng calls the “margin” between the positive and negative cases. In both the lecture and the assignment, the value of \alpha is chosen as 0.2. But also notice that in definition of the function, he uses the square of the norms of the differences of the embedding vectors. So the goal of the training is to create a gap of at least 0.2 between the positive cases (two different pictures of the same person) and the negative case (pictures of two different people). It makes sense to use the squared values for training, since that saves you a lot of compute. It’s exactly analogous to using MSE instead of Euclidean Distance: the square root doesn’t give you any advantage but adds complexity and compute cost.

But then when we use the embeddings in practice, we are using the actual norms, not the square of the norms. So if the difference of the squares is at least 0.2, we’d would expect the differences of the norms to be \geq \sqrt{0.2} = 0.447... or thereabouts.

I didn’t see anyplace where he discusses why 0.2 was chosen as the margin value.

From a practical standpoint, I wrote a little piece of test code to see how the differences play out in the small database that they gave us. Here’s my added cell in the notebook:

# Experiment with distances
younes_vec = img_to_encoding("images/camera_0.jpg",FRmodel)
younes_oppo = -1. * younes_vec
print(f"distance to opposite = {np.linalg.norm(younes_vec - younes_oppo)}")

# Loop over the database dictionary's names and encodings.
for (name, db_enc) in database.items():
    dist = np.linalg.norm(younes_vec - db_enc)
    cos_sim = np.squeeze(np.dot(younes_vec, db_enc.T))
    print(f"younes to {name}: dist {dist} cos_sim {cos_sim}")

Running that gives the following output:

distance to opposite = 1.9999998807907104
younes to danielle: dist 1.2834293842315674 cos_sim 0.17640449106693268
younes to younes: dist 0.599294900894165 cos_sim 0.8204227685928345
younes to tian: dist 1.430235743522644 cos_sim -0.022787034511566162
younes to andrew: dist 1.368172287940979 cos_sim 0.06405222415924072
younes to kian: dist 1.3116600513458252 cos_sim 0.13977383077144623
younes to dan: dist 1.3604931831359863 cos_sim 0.07452907413244247
younes to sebastiano: dist 1.377026081085205 cos_sim 0.051899492740631104
younes to bertrand: dist 1.4408819675445557 cos_sim -0.038070425391197205
younes to kevin: dist 1.2082229852676392 cos_sim 0.27009856700897217
younes to felix: dist 1.3881206512451172 cos_sim 0.0365605354309082
younes to benoit: dist 1.4173320531845093 cos_sim -0.004415145143866539
younes to arnaud: dist 1.3324687480926514 cos_sim 0.11226345598697662

The salient points there are that the distance between the two different pictures of Younes is about 0.6 and the minimum of the distance to any of the different people is about 1.2. So at least in this one very limited case, you can see there is a workable gap between the “yes” and “no” answers and that 0.7 is a good threshold value to make the distinction.

One other note is that it occurred to me to wonder why they didn’t use the cosine similarity between the embedding vectors instead of the Euclidean distance between them. I don’t remember Prof Ng commenting on that either. Later in Course 5 when we use word embeddings, they frequently use cosine similarity rather than the vector differences. You can see that there is also a very clear gap between the “yes” case (cos \approx 0.82) and all the other “no” cases (max(cos) \approx 0.27). Of course a higher value is better in that case, meaning that the vectors point closer to the same direction. My guess is that the reason for using Euclidean distance rather than cosine similarity is that it’s cheaper to compute. So if it works (and apparently it does in the face recognition case), that would be preferable. Maybe in the later word embedding cases in C5 they discovered that Euclidean distance is not sufficient to drive the training.

3 Likes

Actually on further reflection, it occurs to me that this statement doesn’t make sense. In the usual case (as we see here) the embedding vectors are normalized to have length one, so computing the cosine similarity between two vectors is actually cheaper! Remember the formula is derived from this mathematical relationship:

v \cdot w = ||v|| * ||w|| * cos(\theta)

where \theta is the subtended angle between the two vectors. So if the vectors both have norm 1, then it’s just the dot product:

cos(\theta) = v \cdot w

Which is cheaper than taking the difference, squaring that vector, summing the result and then taking the square root of that.

So we are left with a bit of a mystery. At least in the one very limited experiment I was able to show above, it looks like either method provides a clear demarcation between the “yes” and “no” answers. Given that, it looks like either method would be workable. Of course my test case is very limited and maybe a more realistic data sample would show why the Euclidean distance method is preferable in this case, even though it’s not actually computationally cheaper.

More thought and research required. :nerd_face: