For the final part of this model, can we replace it with vector cosine similarity or the L2 distance formula to compare the similarity between two embeddings? This way, we can utilize the retrieval capability of vector databases for fast retrieval.
There are now many models for converting images to vectors, especially the image embedding capabilities provided by large language models. Can we directly generate vectors based on these capabilities and then perform comparisons?
Note that what is being discussed on that slide is how to train the model that computes the embeddings of the face images. We need to choose the distance and cost metric such that the model does a good job of identifying which faces are the same and which are not. Regardless of what we use as the distance or cost function to drive the training, weâre not talking about just looking things up in a precomputed embedding database at this point. Professor Ng explains in the lectures why the Triplet Loss Function is the preferred method for training such a model.
There are some Face Recognition applications in which you do have a precomputed database of face embeddings, as Professor Ng describes and as weâll see in the assignment for this topic. E.g. the case in which you are implementing a secure entry to your office for your employees. But you still need to run the model to compute the embedding of the image from the door cam and then see if it matches any of your database entries. In that case, we do use the norm of the difference between the two embeddings as the distance metric.
This version of DLS was published in April 2021. A lot has happened in ML since then, but I have not been tracking the literature about Face Recognition to know if anyone has come up with more advanced techniques than what Professor Ng teaches us in this section. Maybe we get lucky and someone else can point us to more advanced techniques.
This is my understanding:
The last neuron of this model aims to compare the similarity of two image vectors to determine if they are of the same person. In that case, can the model be made to only output image vectors, and then use algorithms such as the cosine similarity algorithm for vectors to calculate whether the two image vectors are similar?
In this case, then there is just no (trainable) model at all, because there are only 2 parts: retrieve image vector from vector database and calculate similarity.
In this sense, the slide isnât really quite relevant to your question because the slide is about training a model that produces image vectors which can be compared for telling if they are two photos of the same person.
To answer âyesâ to your question, I think we need to first make sure that the vectors retrieved are good for this very specific purpose. Can we make sure they are? I think it takes experimentation to verify, but I could imagine that the context embedded in the image vectors, especially those by multi-modal LLM, could be too rich that goes completely beyond the identification of a person. This means that if I just apply cosine similarity, maybe itâs not just comparing the personâs identity.