# What's the relationship between one-shot learning and triplet loss?

In the one-shot learning slide, it says face recognition systems only need one training example to learn the function. Then when it discusses the triplet loss function, it says it still needs multiple training examples of the same person to train the Siamese network. So what’s the relationship between one-shot learning and the Siamese in Face recognition?

Hello @Martinmin

I think the first 2 minutes of the Video “One Shot Learning” has been repeating over and over again the idea that “traditionally, deep learning algorithms don’t work well if you have only one training example”.

This means that if our classes each has only one training example, and we build our NN to classify, it is not going to work well. However, if we don’t build the NN to classify, instead, we build it to just predict two images’ similarity, then the situation is different.

Basically, it changes from having 100 classes (for 100 persons) to just 2 categories (same person, or not the same person), thus it significantly reduces the number of photos we need for each person. However, do we need only one photo for each person in our training set? No, because we at least need two photos from the same person to train the model to recognize them to be from the same person. Maybe we won’t need it for every person, but at least some people.

From the above slide, it clearly indicates that we have samples x^{(i)} and x^{(j)} that are from the same person, right?

As for the triplet loss, as the lecture said:

But for your training set, you do need to make sure you have multiple images of the same person, at least for some people in your training set, so that you can have pairs of anchor and positive images.

The idea is the same here, we need at least two photos from the same person to form a pair of anchor and positive images. Two is the minimum, but as the lecture suggested multiple images, so it is likely that just two won’t be enough!

Raymond

1 Like

@rmwkwok So one-shot learning here isn’t in the strict sense, i.e. exactly one training example. Instead, it means a few examples are necessary. But then we also have few-shot learning. So I still don’t quite understand why one-shot learning is used here. Based on the lecture, when it talks about one-shot learning, the scenario is that there are 5 images stored in a database, and when a new image comes in, it needs to judge whether this new image matches one of the 5 images. In this case, there is indeed one image in the database (training set) for the same person. Normally, a FaceID database won’t store duplicate images for the same person.

So, it seems to me that, ‘one-shot learning’ here refers to another learning system that’s totally different and independent from the Siamese similarity learning later in the lecture, which aims to learn a similarity function. However, throughout the lectures, there is no mention of one-shot learning except for the Video 2.

Let’s try to understand this term from the context of the lecture video, but if you can provide other context (with examples or discussions), we can also discuss it.

Now, the video. I will quote from it:

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that means is that for most face recognition applications you need to be able to recognize a person given just one single image, or given just one example of that person’s face.

And in contrast, if someone not in your database shows up, as you use the function d to make all of these pairwise comparisons, hopefully d will output have a very large number for all four pairwise comparisons. And then you say that this is not any one of the four persons in the database. Notice how this allows you to solve the one-shot learning problem.

These quotes tell you the one-shot learning problem the lecture is trying to solve. If you have a different one-shot learning problem that is not addressed here, that means you would need to google for other reference.

Obviously, the Simaese network is a choice of the d function we can use for similarity comparison, and in order to train it to learn what is similar and what is not similar, we need photos from the same person, and photos from different persons. Does this mean the system will collapse if one of the persons has only one photo in our system? I think not. However, how many % of persons can have just one photo for the model to still be trained well? I don’t know the answer until trying it out, but I am pretty sure it is not 0%.

After we have trained the model, we can feed a photo from a person never seen in our training set, and hopefully the model will still give a correct prediction!

Raymond

1 Like

What that means is that for most face recognition applications you need to be able to recognize a person given just one single image, or given just one example of that person’s face.

In this quote, “given just one single image”, is this single image use for training or testing? If it is for training, then is this training the same as the siamese training?

Also on this slide, does the training here refer to one-shot training or siamese training?

Obviously this is for prediction, or for testing, but why can’t that photo be used for training?

That slide comes from the One Shot Learning video. That video said we needed a similarity function d, and at the end the lecture suggested that the Siamese Network could serve as d.

From our discussion so far, I found two questions for you as well:

1. The lecture uses Siamese Network to solve the One Shot Learning problem, what do you think their relations are? Do you think the Siamese Network is an alternative for One Shot Learning problem? Or do you think the Siamese Network is a way to solve the One Shot Learning problem? (This is what I think)

2. Do you think in solving a one-shot learning problem, it forbids using more than one photos from the same person? If so, where did you hear this?
If you use the Siamese Network, we need pairs of similar examples - pairs of photos from the same person. Given this, do you think we can’t have some people contributing just one photo per them? We can have those too, because we also need pairs of not-similar examples. I can use these three photos to train:

1. person A
2. person A
3. person B
I can use photos 1 & 2 to form a simiar pair, and (1, 3), (2, 3) as the not-similar pairs. I have only one photo from person B.
1 Like

For question 1, Siamese is a way to solve the one-shot learning problem by learning a similarity(distance) function D.
For question 2, if most training only requires two or very few examples for the same object, or one example only for the negative object, then it makes a difference; otherwise, if it still requires a lot of examples for one class (object), then how is it different from normal supervised learning? More specifically, if I have 1000 training examples for 800 different faces, less than 50% of faces will have more than 1 photos, this might still work. In strict sense, if 1000 training examples are for 1000 different faces, that would be true one-shot learning. I still feel this term confusing, given that we also have zero- and few-shot learning. So I guess in face recognition system, the data for training Siamese network may have two characteristics: 1) The classes are big, typically hundreds of , thousands of, or even tens of thousands. Given so many classes, a typical classification algorithm won’t work well. ImageNet have 1000 classes, and that count of class labels are already big but it has over 1million examples. 2) For the same class, the number of examples are extremely low, usually 1 or 2 or just a few. If that’s true, why not just use the term few-shot learning? I searched one-shot learning on google, and I saw all examples are for the application of face recognition, so I am still unclear about the precise meaning of one-shot here. The author used the new term one-shot, but not few-shot, and there must be a valid reason for that.

For your understanding of zero-shot learning, do you provide exactly zero data samples to train a network? i.e. you don’t train any network?

If you say you will provide samples, then I will ask you how many samples are acceptable for each class or each person.

I believe you will do some research, and when you do it, I suggest you to focus on whether it is in the whole process that we use zero samples, or in some very particular step we do not rely on any sample. This is the same for the one-shot learning we are discussing here.

Another point is that, it is NOT that you cannot use an algorithm that solves a one-shot learning problem in a multi-shot learning problem. You can use the same algorithm on both problems. However, the question is, is the algorithm good for both problems?

Also, are you actually having a one-shot learning problem, or a multi-shot learning problem. In the lecture’s case, it is assumed to be a one-shot learning problem because it said

Now let’s say someone shows up at the office and they want to be let through the turnstile. What the system has to do is, despite ever having seen only one image of Danielle, to recognize that this is actually the same person. And, in contrast, if it sees someone that’s not in this database, then it should recognize that this is not any of the four persons in the database

Now this defines it as a one-shot learning problem. And when solving a one-shot learning problem, does it mean that we HAVE TO use just one image from each person to solve it? We don’t! We can use whatever we have in our hands to solve this problem. The only constraint is, the system has seen Danielle for just once. Just Danielle.

It is the one-shot learning problem. Not a one-shot learning neural network. It is the Siamese network that we use to solve the one-shot learning problem. It is not that the Siamese network is a one-shot learning neural network.

According to wikipedia, zero-shot learning is “a problem setup in machine learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to”, "For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an artificial intelligence model which has been trained to recognize horses, but has never been given a zebra, can still recognize a zebra when it also knows that zebras look like striped horses. " So in this definition, it indeed says no zebras are observed in training data at all.

Regarding “What the system has to do is, despite ever having seen only one image of Danielle, to recognize that this is actually the same person”, it seems the one-shot learning is defined based on the test time?

“ever having seen only one image” : Does this “only one” image refer to the photo that is being taken when someone approaches the turnstile? If true, at test time, there is always only one instance of the class to be presented for whatever algorithms?

@rmwkwok In NLP there are more use cases of one-shot learning, and I will research on that and hopefully get a better understanding of this concept.

Thanks for the detailed explanation!

You are welcome, @Martinmin!

You have shared a great example which showed that we are not absolutely bounded to zero samples in a zero-shot learning problem:

You quoted: “an artificial intelligence model which has been trained to recognize horses”
↑↑ My interpretation is that the model was trained with as many horses photos as we had, not bounded to one or zero or any special number. Again, not ZERO.

You quoted: “but has never been given a zebra”
↑↑ The model had never been trained with a single photo of zebra. It is zebra, not horses.

Now, back to our lecture’s one-shot learning problem example:

It could be trained with as many photos as we had from Kerry, Sammie, Jamie, Tracy, Kerry, Jackie. They can be 10 photos per person, or 15 per person. But it had only been trained with one photo of Danielle. Not two photos, not three, but one.

I think it is good to call it a “one-shot learning problem”. It might seem to suggest that it’s defined based on test time, but we can also ask “but the one-shot was provided for training”, or “but the one photo of Danielle was provided for training”. If we said it is a “one-shot learning problem”, then things look clear. It means that “it is a problem that a certain (not all) person has been learnt by the model for just once, but now we need to make very good prediction with that person”.

Let’s make some comparisons again:

Zero-shot learning problem: a problem that the model has never seen a zebra, but as many horses as we had, and can still make great predictions on many (not zero, not one) zebra photos

One-shot learning problem: a problem that the model has only seen one and only one photo of Danielle, but has to make great predictions on various (not zero, not one) Danielle’s photo since he can look a bit different everyday.

At training time, a zero-shot learning problem is a problem that the model has seen a zebra zero times, but at test time, a well resolved zero-shot learning problem requires the model to make great predictions about zebra.

At training time, a one-shot learning problem is a problem that the model has seen Danielle once, but at test time, a well resolved one-shot learning problem requires the model to make great predictions about Danielle.

Now, the Siamese network is our tool, our method, our approach to solve a problem. The lecture used it to solve the one-shot learning problem presented in the lecture. That’s it! That’s all!

In the Siamese network, I am going to train it with 10 photos of Jackie, 10 photos of Sammie, blah blah blah. Except that I have only one photo of Danielle. This is the one-shot learning PROBLEM that was presented in the lecture. By extension, the problem can include any new staff besides Danielle, as long as we have exactly one photo of those new staff in my training records.

Once last time, the PROBLEM is that the model has seen only one photo of Danielle, NOT at most one photo of every single person in the world.

Cheers,
Raymond

@rmwkwok I think by now I am convinced that zero or one shot refers to the number of examples in training data. Then I have another logical follow-up question. For in a dataset of 1000 training examples, what might be the proper ratio of one-shot examples? The following could be two possibilities:

• 800 images of at least two persons (each person has at least 2 images), then the rest 200 are single image for the same person
• 200 images of at least two persons(each person has at least 2 images), then the rest 800 are single images for the same person

In reality, do you think which possibility of data ratio in training are more realistic?

In the first case, one-shot specifically refers to the 200 images, which the algorithm only sees once during training, and in the 2nd case, one-shot refers to the 800 images which is seen only once during training.

I am thinking to make one-shot work well, perhaps for 1000 examples, at least 50% should have more than 1 photos for the same person.

Hello @Martinmin,

I believe anyone who has devoted to learning will know that it is sometimes super difficult to change understanding. Sometimes it took a lot of critical thinking and (extreme) examples to convince ourselves. This is not easy. So, great work, great work! I would like to further suggest us to stay open to new exceptions. If you really come across any exceptional examples in the NLP courses that you have mentioned, don’t hesitate to share them here and we can analyze them together.

Here is how I will analyze your two possibilities:

1. The ratio is related to the method we use to solve the one-shot learning problem. Let’s assume it is the Siamese nework. (We can’t ignore the network, right? Afterall the data is fed into the network)

2. The Siamese network needs to learn what is similar and what is different

3. From your possibility 1, if 400 people each has 2 photos, and 200 people each has 1 photo, then I can form 400 similar pairs, and 179700 not-similar pairs

4. From your possibility 2, if 100 people each has 2 photos, and 800 people each has 1 photo, then I can form 100 similar pairs, and 404550 not-similar pairs.

5. Based on the above, I would say possibility 1 is better because it has more similar pairs.

Now, you may not like that I have changed your notation from “one-shot ratio” to “similar-to-not-similar ratio”, but this is the most relevant way to analyze it taking the Siamese network into account, agree? We probably shouldn’t use a less relevant way or an irrelevant way to look at it, right?

Let’s continue:

1. if X people each has 2 photos, and Y people each has 1 photo, then I can form X similar pairs, and \frac{(X+Y) \times (X+Y-1)}{2} not simillar pairs. Obviously, the two pairs can never be equal, meaning we can never find a one-shot ratio X:Y such that the number of both pairs are equal. and not-similiar pairs are always way more.

2. Therefore, I think the model will be more hungry for similar pairs. So we want many non-one-shot samples, and many enough to train a good SIamese Network.

Lastly, since one-shot learning problem is a problem, it doesn’t quite make sense to actively collect one-shot samples. It is a problem right? It comes as a requirement or a problem, not as a resource. We need enough similar pairs to make the model capable of addressing the one-shot problem, instead of we happily collecting many one-shot samples. At the end of the day, we need enough similar pairs to train the Siamese network well enough.

One-shot is a problem, and the solution is not necessarily to collect more problems. Focus on the network. How to train a network well?

How to train a network well?
How to train a network well?
How to train a network well?
How to train a network well?
How to train a network well?

Happily collecting one-shot samples will not give us a model. Once again, let’s go back to the basics, how to train a network well?

Raymond

@rmwkwok So ideally, one-shot examples should be regarded more as outliers, instead of the ‘norm’, to make the network learn better. For creating a face recognition system, ideally employees should be asked to provide at least two photos. For some exceptional cases and especially for new employees, then real one-shot learning comes in to solve the problem.

@Martinmin

Ok! That is one ideal, and I have no objection with that.

If I recognize one-shot learning problem as a problem, as if it is a problem that my client asked me to solve, then:

1. one-shot examples are not outliners. Outliners sound like something I want to get rid of. But I can’t get rid of it because this is what my client paid me for.

2. one-shot examples are the problems I need to deal with. They are not outliners.

3. one-shot examples can make the network learn better. Again, SIamese network needs not-similar pairs. I can use one-shot examples to form not-similar pairs, even though I probably already have more not-similar pairs than enough.

4. Since I am tackling an one-shot problem, I must add those one-shot samples, though I wouldn’t be too motivated to happily look for more than enough one-shot samples than the scope set by my client. I needed to take care of my model performance. Flooding my model with one-shot might not be a good idea.

5. I wouldn’t ask employees to provide at least two photos, because, again, this is why my client paid me. I should address the challenge, not kill the challenge.

Above is like a role play but I think it is also the altitude to understand it as a problem. One-shot learning problem usually comes with limitations which made it both one-shot and a problem. If we could remove those limitations, it would not have been a problem in the first place. We probably need to recognize and accept it as a problem, take on the challenge, and figure out how to solve it.

Cheers,
Raymond

Yes, I agree. Good summary. Face the reality and solve the problem with an appropriate solution.