Hello. I am trying to build a small model inspired by what professor Andrew Ng has taught about FaceNet and Triplet loss. However, it is not easy:((.
I use cifar10 and create triple of images to feed into the model. I explain all the detail in the jupyter notebook attached.
However, after training, the model behaves quite strangely. It just simply pushes all output vector to zeros and the result is even worse than that of the initial model (before training). I cant seem to know the reason. Could anybody kindly take a look at my notebook and enlighten me about the reason. Thank you very much!
kim.ipynb (109.2 KB)
Hey @FreyMiggen,
Please help me understand some things better so that we can debug this together.
- In the code cell in which you are defining the function
create_triples
, what exactly is the role oflabels
, cause I can see that it’s simply an array of 1. I can’t see any samples withlabel = 0
. - Also, as far as I can recall, the labels were needed when we faced the face verification/recognition problem as a binary classification, but it doesn’t seem to me that you have adopted that approach.
- The next thing I am a little confused with is why you have compared the triplet loss before training on the first 10 samples and the triplet loss after training on the next 10 samples.
- Don’t you think that in order to check the model’s competency, we should compare the triplet losses before and after training on the same samples, be it first 10, next 10, or some other samples?
- I tested on the same samples (with some modifications in model training), as can be seen here, and it seems that the model is performing to some extent at least.
- Now, perhaps the most intriguing thing that I would like to ask is what exactly are you trying to achieve with this model.
- The framework that we have learnt and that you are trying to modify is for face recognition and/or face verification.
- But you have used it on CIFAR-10 Dataset. Initially, I thought too that this framework can be easily used for classification, but let’s take a step back and see how classification may differ from face verification.
- Let’s say we have 2 classes, “table” and “chair”. I take 1 sample from each class, where both the samples have the same background, same lighting conditions, and the colour for the objects is also the same.
- It is highly likely in this case that their embeddings (128-dimensional vector) are quite close to each other, however, we are asking our face verification model to separate their embeddings.
- Also, let’s say that I take 2 samples from the class “chair”, where both the samples have completely different backgrounds, different lighting conditions, different colours, and perhaps even different shapes.
- In this case, it is likely for these samples to have quite distant embeddings, and we are asking our face verification model to bring these embeddings closer to each other.
- Don’t you think it would simply mess up our ConvNet?
- On the other hand, a good classification model may not attempt to shift the embeddings at all, but could simply look for a particular neuron to light up (or activate) to classify an object as a table or chair, for instance, a neuron which associates with the area of an object.
Let me know your opinion on these, and we can continue our discussion forward.
Cheers,
Elemento
Hi @Elemento, thank you for spending time on my question!
I would like to answer some of your questions:
- First, you’re right,
label
is just an array of 1. Because as fas as I know,triplet loss
does not requirey_true
, I thought that it doesnt matter what value I assign toy_true
. As a result, I just passed an array of 1 for the sake of training on keras. Please tell me that I am not wrong about it:(( - Second, I fidgeted with the set of 10 examples to test
triplet loss
on and I failed to correct it back to the same set before loading it. I am really sorry if it confused you. However, I have just adjusted to use the same example set and you are right again! The result after training is improved a little. Please correct me if I am wrong there but the loss of 2 for 10 examples is aproximately 0.2 over 1 example, and with alpha set = 0.2, it means that the model simply just pushes all output vector toward zero vector and doesnt not actually learn something. - Third, this is another issue that puzzles me. Why the val_loss value is just around 0.6 but when I test with
triplet_loss_test
, nearly every set of 10 examples yields the loss of above 2.
About your question on what I would like to achieve from using cifar10. I completely agree with what you said about the difference between classification and verification. However, my naive thought is that if one Neural network could find the similarity between images of the same labels and assign them one number for label, maybe I could train a Neural Network to detect that two images are similar and one is just wholly different. I am mayby terribly wrong but by the logic you offered about classification, is it possible for the model to do the following: For images of the same label, activate the same neurons and shut down all the others. Is it efficient to make two images of the same label closer than to a image of a different label?
Again, thank you for your time. I am still a newbie so it is very likely that my intuition above is completely wrong.
Have a good day,
Frey
Hey @FreyMiggen,
Yes, you are correct; the triplet loss doesn’t require y_true
, and in order to eliminate the use of y_true
, we might need to make some changes I suppose since simply not passing these into the fit
method will give an error.
Now, the fact that a loss of 2 over 10 examples is approximately equal to a loss of 0.2 over a single example may not be true. The only way to make sure of it is to monitor the loss corresponding to individual examples, since one example could also give a loss of 2, and others may not contribute to the loss at all.
However, the below lines of code can definitely prove your statement:
print(np.sum(np.abs(result)))
>>> 0.1503207
print(np.sum(result == 0), 3*10*128)
>>> 3827 3840
For a particular config, out of 3840 elements, 3827 elements of the result
are 0, i.e., the model indeed pushes all the output vectors to 0.
This is intriguing indeed. First of all, in order to compare the losses, the batch-size should be equal to the number of samples on which we are individually evaluating the loss (both 64 in the notebook). After the 10th epoch, the val loss as given by model training is somewhat 0.2002
, and the one given by the function is around 12.799
, but if we divide the later by the number of samples (or batch-size), the loss given by the function would be around 0.1999
, and the values now are pretty similar.
First of all, the above ideology is not wrong at all, and in fact, we have seen this happening. We train FaceNet to do face verification and we train say AlexNet for image classification, so definitely it can be done.
But the important thing here is the dataset. Compare 2 cases; in case-1, there are 2 images of the same person, and in case-2, there are 2 images of the same class, say ''chair". In your opinion, in which case the images would be more similar. In my previous reply, I highlighted how 2 images belonging to the same class could have very different feature vectors (case-2), but even after all the changes in background, lighting conditions, etc, how much do you think the images of a person will change?
In my previous reply, when I said that an image classification model could activate the same neurons for objects of the same class, I wasn’t referring to “shutting down all the other neurons”. Let’s say, that the neurons represent colour, area, height, width, etc. Now, let’s assume that based on the neuron representing area, the model classified 2 samples as belonging to the “chair” class. In this case, do you think it is possible for the model to shut down the neurons corresponding to height, width, colour, etc?
Here’s the new version associated with the above explanation. I hope this resolves your query. Let me know if you need any further help.
Cheers,
Elemento
Hi @Elemento, you were right. The biggest problem was the dataset that I used. I have switched to Fashion Mnist and with a little modification on my code, everything works perfectly well now!
Thank you for supporting me!
Hey @FreyMiggen,
Thanks a lot for informing us. I do believe that the samples of any class in Fashion MNIST would be more closer to each other in shape-form than the samples of any particular class in CIFAR-10. For instance, a “coat” doesn’t have that many different shapes as a “table” could have. I suppose your code will also work on the MNIST dataset following the same argument.
I am glad I could help.
Cheers,
Elemento