Hi Sir
I have a small doubt. I understood the intuition behind triplet loss. Say, We have trained our model for 1k persons and deployed it, Now if a new person joins our team, I need to again fine tune the model by taking anchor positive and false images for that new person and update the parameters right? Because that person is also a part of our org so we need to capture their identity and authenticate them.
My second doubt is, When we are setting up FaceID in iPhone, It captures multiple pictures of my face in different angles. So, It basically keeps a straight pic of mine as anchor and all the different angles as positives and someone else pictures as false images and trains the model ? Is that right? When I try to authenticate, It just computes the similarity?
If a new employee joins your team, you do not have to do any additional training of the model. All you need to do is take a picture of them, encode that picture with your pretrained model and then create an entry in your database for that person with their name (and perhaps other data) and the encoding of their picture. Then when they try to open the door, you run the equivalent of the “who is it” algorithm they have us build in the assignment.
In the case of face recognition on your phone, it is a similar story: they are not retraining any model. They are taking your picture from different angles and then encoding those images with the existing already trained model. Then when you try to authenticate, they compute the encoding of your face now and compare it to the database of encoded images they built when you activated FR authentication.
The larger point here is what they mention in the explanations in the notebook: creating and training a model like FaceNet requires a large training set and a very large amount of compute to perform the training. If you have done that successfully, then you have a model that knows how to translate a picture of any face into an encoding in the 128 dimensional “embedding space” that they selected to represent faces.
Ahhh, This makes sense now. To reiterate, Once the model is trained, The encodings it gives will be almost same for similar people images. Hence, We just need to compute the encoding and store it. And at test time, We just compare it ? And sir, Any tips on what encodings we should choose for a given person? Since at training set, We have multiple images of same person. Is it advantageous to select a pic of straight angle? Intuitively, I guess any pic of a person would be alright since the model output encodings will be almost similar for them.
We don’t “choose” those, other than defining the number of dimensions in the encoding space. What the encodings actually represent is learned during training, driven by the triplet loss cost function and back propagation on the training set. At some level we don’t even know what they are and can only speculate, but there are probably ways we could figure it out by running experiments similar to those that Prof Ng describes in the lecture “What are Deep ConvNets Learning” in Week 4 of Course 4.
As to the type of pictures that work best, it would depend on what the training set looks like. If they only used nicely centered “portrait” style photos, then that’s your best bet for real world inputs as well. But I would guess in the case of the FR models on phones, they probably used a wider variety of pictures taken with phones, e.g. “selfies” at various angles and in various lighting conditions, since that’s what the real inputs will look like. The larger point here is that as the system designer it’s your job to figure out what your training set needs to look like in order to get results that will be “good enough” for whatever the real world use case for your model is.