In the triplet loss formulation we only consider rows of the similarity matrix (to calculate the mean negative, and the closest negative lost).

But columns hold the same kind of information right ? Why not use columns too when calculating these two losses ?

I’m not sure what are you thinking

- Are you suggesting to use them
*instead*of rows? (Then why? What would be the advantage?) - Or are you suggesting to use the columns
*with*the rows? (Then also how (sum?) and why?)

These are cosine similarity scores. It does not matter if you get them `(v1, v2)`

or `(v2, v1)`

(this would flip the columns vs. rows).

Or I did not understand what you are asking?

I didn’t realize the similarity matrix was symmetric. How exactly is it calculated from the 2 batches and why it symmetric ? (I thought, if S denotes the similarity matrix, S[i,j] would be similarity of row i of batch 1 with row j of batch 2, and S[j, i] would be similarity of row j of batch 1 with row i of batch 2. Which are only the same on the diagonal)

Ok, to make things concrete… Let’say the batch size is 256 like in the Assignment. Then the data_generator returns a tuple of numpy arrays, for example, let’s say it happens to be that the questions have `max_len`

of 64, then the output from data_generator is a tuple (b1, b2), both of the same shape:

```
np.array(b1).shape
(256, 64)
np.array(b2).shape
(256, 64)
```

The Siamese model receives this tuple of inputs (b1, b2). And does the following for each strand:

- Embedding
`d_model=128`

→ (256, 64, 128) → - LSTM → (256, 64, 128) →
- tl.Mean(axis=1) → (256, 128) →
- tl.Fn(‘Normalize’… → (256, 128) →

So the output of the model is a tuple of (v1, v2):

```
v1.shape
(256, 128)
v2.T.shape
(128, 256)
```

When you dot product the `v1`

with `v2.T`

, you get `scores`

of shape (256, 256) - the similarities between each 256 questions. As far as I understand here is your initilial question about columns?

The diagonal values of `scores`

are “positives” (similarities between duplicate questions, where i = j), every other value are the “negatives” (similarities between row question and other questions, where i != j). When we calculate `mean_negative`

, we average each “negative” similarity - so the columns disapear but we get what we want - what is the mean similarity to every other question that is not the duplicate.

Let me reformulate.

I understand how the output of the model is calculated.

Say we feed (b1, b2) to the model and (v1, v2) are the outputs.

Each row element in v is an encoding of a question at the same row of b.

To compute the similarity matrix, we take the dot product of v1 and v2.T and as you said, the shape is (batch_size, batch_size). I will denote S the resulting similarity matrix.

S[i, j] is the similarity of question i in b1 and question j in b2.

S[j, i] is the similarity of question j in b1 and question i in b2.

For any y, b1[y] and b2[y] are duplicate questions and one can expect that, if the model learns correctly, v1[y] and v2[y] to be similar. But that doesn’t mean the exact same.

So S[y1, y2] = dot(v1[y1], v2[y2]) should in essence be similar at the end to S[y2, y1] = dot(v1[y2], v2[y1]), for any y1 != y2

So now back to the actual question. When we calculate the loss, we compare for each row i, the positive element S[i,i] with elements S[i,j], for every j. Meaning we compare how the similarity computed for the duplicate questions b1[i] and b2[i] fares with the non duplicates b1[i] and b2[j], for every j. (reference question is b1[i] and we compare it with non duplicate questions of b2)

What I was wondering is why we don’t take into account the columns too. Meaning, why don’t we compare how the similarity computed for the duplicates b1[i] and b2[i] fares with the non duplicates b1[j] and b2[i], for every j. (this time the reference question is b2[i] and we compare it with non duplicate questions of b1).

The loss I had in mind would look something like this.

Cost1 = max(

-S[i,i] + (sum over j != i (S[i,j]) + sum over j != i (S[j,i])) / (2 * batch_size - 2) + margin,

0)

Cost2 = max(

( - 2 * S[i,i] + closest neg over j != i (S[i,j]) + closest neg over j != i (S[j,i]) ) / 2 + margin,

0)

Cost = Cost1 + Cost2

I understand your question better now - what essentially you are asking is - why we only focus on b_1 instead of treating both batches equally.

This is a good question and I can only offer my take on it - I think the course creators had in mind that current `TripletLossFn`

implementation is complicated enough, so complicating it further would be hard on learners. (It is one of the hardest exercises in the course already).

But you’re right by asking - since we’ve done most of the work/computations and now, having similarity scores, **why not find the closest_negatives and mean_negatives for b2 too** (questions on the right side, since they have the same meaning but are not exactly the same words) and then adjust (sum) the losses accordingly. This should have been more efficient.

As an alternative (but still not as efficient loss calculation), we could have incorporated this behavior (switching randomly q_1 with q_2) in `data_generator`

(so that `input1`

would randomly be appended by q_1 **or** q_2 and not necessarily q_1; the other q_x would go for `input2`

).