Hi everyone,
I have some confusion about the solution of one-shot learning problem, Prof. Ng said that we the problem is that we want to recognise the person given one image of this person is out database, and said later on that when training the Siamese Network we need multiple pictures per person. I feel like there is a contradiction from my understanding.
and Thanks in advance.
Recognition (i.e. using the model after you’re trained it) has different requirements than training.
Thanks for clarification, but I understood that we don’t use normal ConvNet in this task due to small training set we have, or this small training set is (10k images for 1k classes)?
Please give a reference to the video you are discussing (video title and a time mark).
One-shot learning from min 1:25 @TMosh
You have tagged this post with week 4.
But Course 2 (“Improving DNNs…”) only has three weeks.
Im so sorry, I thought I have tagged it right
Should I delete it and repost it there?
You can change the week tag - use the little “pencil” icon next to the thread title.
Done.
Summary:
Assume we’re limited to having one image in the training set for each person. We cannot use a CNN, for two reasons:
- One image is not enough to train a CNN.
- The CNN will output a label for each person, so if you want to add a new person, you have to re-train the entire CNN.
Instead, we will train a similarity function to identify whether there is a match between an image in the data set and the person standing outside the door. Adding a new person only requires that we add their image to the data set, so we do not need to retrain the similarity function.
However, we do still need a lot of images to use to train the similarity function. It needs to learn the important factors for the generic task of comparing images of faces.
For example, the similarity function may learn that the distance between the eyes, the nose shape, and the location of the corners of the mouth, are good metrics of identity.
It might run this process on the images in the data set, and the image of the person on camera, and provide a similarity score for each pair.
So we can say that in case we have only one image for each person in a company, and there is a dataset of many images for each person (10 images as mentioned in the notes) we can learn a similarity function on the dataset and use it in our company (after getting the encoding of all employees’ pictures using the same siamese network used in training).
Am I right?
This seems correct to me.
Thank you so much.