Although the triplet loss function is highly mentioned on the exercise, the Inception-Resnet-v1 model is not using it, but a softmax loss function as standard classification problems (Classifier training of inception resnet v1 · davidsandberg/facenet Wiki · GitHub).
I’m confused because the similarity function and the one-shot learning concepts seen on videos.
What kind of pre-trained model we’re using? Why does it work? How can we train a NN to apply the FaceNet or DeepFace models? Why an inception architecture?
Hi dmunera3,
The inception network discussed in week 2 serves to classify images. The Residual_Networks assignment of week 2 uses a ResNet-50 model with a softmax activation to classify images. This is a model closely related to the inception model as you can see here. An implementation of this model is presented in the assignment of week 2.
The FaceNet system discussed in week 4 serves to recognize faces based on a mapping of embeddings. This is discussed in this paper. It uses triplet loss which serves to distinguish similarities and differences between faces in images (which is why it works). The implementation of this system is demonstrated in the assignment of week 4.
I hope this clarifies.
Hi reinoudbosch,
In the assignment of week 4: “Face Recognition”, the numeral 3 says: " The network architecture follows the Inception model from Szegedy et. al. An Inception network implementation has been provided for you, and you can find it in the file inception_blocks_v2.py
to get a closer look at how it is implemented."
When I explore the FaceNet github repository where the used model (“keras-facenet-h5/model.json”) comes from, I find that they use an Inception-Resnet architecture (useful for image classifcation and not for face recognition) with a softmax loss function.
If we print a summary over the used tensor model (“keras-facenet-h5/model.json”), it doesn’t seem as an architecture that used the triplet loss function, because in that case we’d need layers which concatenate the three same NNs (Siamese network).
Thanks for the reply.
Hi dmunera3,
In assignment 4, the inception network is only used to create the encodings to be used by facenet. It is not used to classify images, and therefore its last layer is an encoding vector of size 128 rather than a softmax.
This is why the function triplet_loss is needed for the facenet system that is being built to calculate the triplet loss between encodings.
Your confusion may be due to the file ‘keras-facenet-h5/model.json’ coming from the local environment rather than the github repository. You can find this local json file if you go to File → Open
Hi reinoudbosch,
Thanks for your quickly replies.
I think my confusion is also due to the missing backpropagation process on the lectures and the assignment. My question is:
If I have an inception network to create the encodings for three images (A, P, N) and calculate the triplet loss function, how can we optimize the weigths of just one single model? For doing backpropagation on CNN the activations of each layer are required. Which activations should I use?
Hi dmunera3,
Good point. It might have been good to discuss the backpropagation process.
The training process is discussed in the original paper. Put in a simplified way, it consists of propagating multiple inputs through the network (x(i)a, x(i)p, and x(i)n, then calculating the triplet loss, and then performing backprop. In fact, this is in line with what the inception network was built for, i.e. distinguishing classes - in this case distinguishing different faces. So the adjustment of parameters is straightforward and aims to support the extraction of features that show differences or similarities.