Question about OneShot learning and Triplet Loss

Hi, friend and mentor,

In the video of one shot learning, my understanding is that, you have only one pic (for example, a company may have 1 pic of you), and how to do this recognition. So, introducing Siamese network and triplet loss.

However, in the middle of triple loss video, profess said

if you had just one picture of each person, then you can't actually train this system. But of course, after having trained a system, you can then apply it to your one-shot learning problem where for your face recognition system, maybe you have only a single picture of someone you might be trying to recognize. But for your training set, you do need to make sure you have multiple images of the same person, at least for some people in your training set, so that you can have pairs of anchor and positive images.

Well, triplet loss needs those 2 couples (d(a,p) and d(a,n)) , for sure it needs tons of data. but whats meaning of “you can then apply it to your one-shot learning problem where for your face recognition system” ? He was saying, during the training, I need to give many pic of myself, but after training, I just use one new pic of myself for testing or using? Not sure if I made this clear or not, but I feel like you still need tons of data, so… there is NO one shot after all. lol. Thank you.

I think the point is that once you have a trained system, you can use it for recognizing or comparing faces that it’s never seen before. Of course if you are seeing a person for the first time, that means you don’t have a picture of them in your database. But you can still imagine ways that your system could be useful. For example, suppose a person shows up at a kiosk in airport security and presents their passport. You could use the system to verify that the person in front of you matches the passport picture even though you don’t have that person’s picture in your database, right? You take a realtime picture of the person, scan the passport picture and feed both of them through your trained algorithm. That generates “embedding” vector outputs for both images. Now you can compute the 2-norm of the distance between those two vectors and have a threshold value at which you say “yes, this person is the real owner of this passport”.

But if you have deeper questions about this, there is a recent ongoing discussion of this general set of issues that’s worth a look.

1 Like

Hi Paul. I want to double-check my understanding. So, first, use tons of data to train this system. Let’s assume this system is well-trained for 100 people. Then, the 101st person shows up, and he is new (never shows up in this 100 people data). Now, you give this system 2 new pic ( one is an official pic, like passport style, and another is an instant face scan), since this system is well trained ( the “encoder” is well built), the output of the system will output the distance d(A,P) . Is this correct?

In other words, you do need tons of data to train it, but once it’s well trained, all new data (new person face, just one) can be handle correctly later.

Yes, but note that this sentence is not anything special to Face Recognition. That’s always the way it works with any ML application, right? You have your large training set, then you have your smaller test set to verify the results of the training. Then you deploy your system in the “real world” and if you’ve trained it well, it gives good predictions on “real” inputs that it’s never seen before. Maybe the one other point here is that “large” in this case is probably way bigger than 100. I didn’t read the papers on FaceNet, but I’ll bet the size of the training set was several orders of magnitude bigger than 10^2. :scream_cat:

The only thing that’s unique here is that you might think that those “embedding” or “encoding” vectors that you get out of the trained FR system in this case might not be very useful if you don’t have a database entry of what the person looks like a priori. I was just giving an example of a case in which it can still be useful.

1 Like