In week4 the similarity between vectors at the output of the Siamese network was defined as s(v_1,v_2)=cos(v_1,v_2). In addition, it was stated that the similarity with the positive example should be trained to be 1 and the similarity with the negative example should be trained as -1.
Having these definitions - If we take all possible triplets from the database, doesn’t it converges all of the output vectors from this model to be on the same axis? meaning that there will be only two possible vectors at the outputs of this model and they will be the negative version of each-other.
This doesn’t makes sense as it divides the world of possible sentences to two groups - all the sentences in the first group has the same meaning that is different from the sentences of the other group that has the same meaning between themselves.
Something is not right here. How is this logical?
Thanks
Roee
Hi Roee,
I am not sure I fully understand your question, but maybe this clarifies:
The Siamese network is used to determine similarity and dissimilarity. In theory this is a singular axis. So ideally, similar vectors should point in the exact same (positive) direction while dissimilar vectors should point in the exact opposite (negative) direction of the same axis. It is not about determining meaning in the output, it’s only about similarity and dissimilarity.
I’ll try to explain my question by the following example:
suppose we have 6 sentences: s_1,s_2,s_3,s_4,s_5,s_6 such that s_1 and s_2 are similar, s_3 and s_4 are similar, and s_5 and s_6 are similar. Suppose the Siamese network component is f(x) such that f(s_i)=v_i.
If the model has converged to its optimal solution then we have v_1 \cdot v_2^t=1 for every consecutive sentences and v_i \cdot v_j^t=-1 for all the others. Because v_1 \cdot v_2^t=1 and all vectors are normalized it is safe to say that v_1=v_2. For the same reasoning, we can say that v_i=-v_j for all non consecutive vectors.
So in the optimal state we have v_1=-v_3 ,v_3=-v_5 therefore v_1=v_5. But v_1 and v_5 are not consecutive and therefore should not be equal but the negative of each other. So this state of convergence is mathematically impossible - which I suspect is non-stable for such systems - as the optimizer will continue switching vectors values.
Ofcourse that this method eventually works so I think something is helping it.
Following the content of the course, what I think helps in this case are two things:
- the operations on the batch sort of reduces this effect (as having the contradicting constraints together may reduce the instability)
- The fact that we are not using this similarity measure as is but applying a relu function with an offset that is less than 1 on it - so we don’t really demand to converge to the extreme point.
Hi again Roee,
Now I get your question.
It seems to me that if the network has been trained sufficiently, then if v1 is found to be dissimilar to v3 and v3 dissimilar to v5, it does not follow that the network will find v1 to be similar to v5. In case v5 has some similarity to v1, the network should have been trained to distinguish between the two vectors.
You mention the operations of the batch and the relu function as contributing to this discriminatory functionality. The dimensionality of the model would seem to be another factor, with the final single dimension of similarity/dissimilarity being an artificial projection on a single axis. So before projection on this single axis, the full dimensional representation of s1 is dissimilar to that of s3 as well as to that of s5, resulting in v1 being dissimilar to both v3 and v5, even if v3 is also dissimilar to v5 (due to the full dimensional representation of s3 being dissimilar to that of s5).
1 Like
Hi @reinoudbosch ,
Thank you for your answers!
I understand your intuition behind the dimensionality aspect - but this is not an issue of projection on another axis. When the inner product of two normalized vectors equal to 1 (or the angle between them is zero) - it means that they are the same vector - in all of their dimensions (not just a projection).
In any case, I think I now understands the idea behind this similarity concept.