Question on Contrastive loss and Embedding

He everyone,

My question regards the contrastive loss and what the embedding space represents.

In the “Applied Similarity Learning: Signatures & Satellites” lab, we learned about the Triplet and contrastive loss. as I understanding it, in the Satellites case, the model learned the concept of chance, and we calculate a Positive or Negative change by the heuristic HSV color analysis (5% increase or decrease in greenery area ).

In this particular case, why don’t we calculate this Negative or Positive change using embedding directly?

Let E be the embedding model, E(b) and E(a) are the embedding representations of the “Before” and “After” images, respectively and D be a distance metric. Why don’t we calculate the change as?

Positive if D(E(a), E(b)) > 0

Negative if D(E(a), E(b)) < 0

Of course this would be possible only if the distance D of D(A, B) is not equal to D(B, A). In other words, if contrastive loss does not yield a symmetric space.

If that is the case, how do we enforce this concept of “direction of change” in contrastive loss? Do I have to represent a “delta” embedding vector, create a new embedding model to represent this “sign,” or calculate a “direction” vector?

Thank you all.

hi @Esdras_Souto_Costa

In the “Satellites” case study, the model is a self-supervised learning approach utilizing the BYOL (Bootstrap Your Own Latent) framework which learned the concept of chance by predicting the change between two temporally sequential satellite images, with positive or negative changes determined by a heuristic HSV color analysis.

The model’s task was effectively binary or categorical classification of change using this heuristic label, which is why standard embedding and contrastive loss methods were not directly applicable.

Why Embedding and Contrastive Loss Cannot Be Applied??

  1. Heuristic-Based Labelling vs Feature Similarity - Contrastive loss functions are fundamentally designed to pull representations of similar data points (positive pairs) closer together in an embedding space while pushing representations of dissimilar points (negative pairs) apart. The definition of “similar” in typical contrastive learning is usually based on augmentations of the same image (instance discrimination) or known class labels. In the “Satellites” case, similarity was not inherent in the visual features in a way that contrastive loss could easily leverage for general representation learning. Instead, similarity was defined by a specific, external heuristic: the 5% change threshold in greenery area (via HSV analysis) where the model wasn’t learning general visual similarity but a narrow definition of “change” dictated by this specific, predefined rule.
  2. The model used the HSV analysis to create “pseudo-labels” (positive or negative change) rather than discovering underlying feature relationships through contrastive comparison. The model was trained to predict these pseudo-labels, making it a supervised supervised classification problem, not an unsupervised representation learning problem in the traditional contrastive sense. The loss function used was likely a classification loss (like binary cross-entropy) suitable for predicting the heuristic-derived change label.
  3. Lack of Suitable “Positive” Pairs: For a contrastive loss to work effectively for general representation learning, you need a reliable way to define what constitutes a “positive pair” (two things that are similar) and a “negative pair” (two things that are not similar). In the satellite imagery case, simply using two images that both had a “positive change” label doesn’t guarantee they are visually similar in a way that would be useful for a broad embedding space. They might have vastly different visual characteristics but meet the arbitrary 5% greenery threshold. The model needed to distinguish between “change” and “no change” based on a specific metric, not learn a general-purpose, feature-specific embedding space where visual similarity equates to the same class.
  4. The BYOL approach used in the study avoids direct contrastive loss altogether. It uses a “momentum encoder” to provide a target representation, training the online network to predict the target network’s output. This architecture is specifically designed to work without negative examples, making it suitable for situations where defining negative pairs is difficult or ambiguous.

Overall, the model was tailored to solve a narrow classification task defined by a heuristic. Standard embedding and contrastive loss methods would require a more robust definition of visual similarity to create meaningful positive/negative pairs and therefore would be inappropriate for this specific, pseudo-labeled, and heuristic-driven task.

Hope this helps!!

Regards

DP

1 Like

Hi @Esdras_Souto_Costa,

To address your question about the design choices and exactly what the embedding space represents, here is the reasoning behind why the notebook is set up this way:

  1. A Teaching Moment: The use of WeightedContrastiveLoss here is deliberate to showcase a different tool. In the first part of the lab, the notebook used TripletMarginLoss. While this satellite dataset could have been formatted for Triplet Loss, using Contrastive Loss demonstrates a completely different, yet valid, approach to training Siamese networks. It highlights that there is more than one way to solve these problems.

  2. The Nature of the Embedding Space: You asked what the embedding space represents. In this design, the embedding space represents a manifold of visual similarity, not a semantic map of “growth” or “decay.” The loss function uses Euclidean distance (L2​ norm), which is symmetric. In this space, the relative distance between E(a)and E(b) encodes the magnitude of the change (how much they differ), but not the direction of the change. It acts like a ruler, not a compass. It tells you that the images are far apart, but not why.

  3. The “Two-Step” Solution: Because the loss function forces the model to focus on the binary concept of “Change vs. No Change”, you need a second step to recover the “direction” (Positive vs. Negative).

    • Option A (Used in Notebook): Use the Siamese model to detect the change using an optimal threshold, then use a heuristic (like the HSV color check) to categorize it.
    • Option B (Alternative): You could train a second, separate classifier that only looks at the pairs the Siamese network flagged as “Changed” and classifies them as Positive or Negative.

Best,
Mubsi

2 Likes

Thank you @Mubsi and @Deepti_Prasad for the explanation.

I understand that this lab utilizes a simple contrastive loss approach for pedagogical purposes.

While the BYOL paper relies heavily on various data augmentations (views) of the same image, this lab does not appear to use the same extent of augmentation, though it is clearly inspired by the methodology.

Indeed the loss function employs L2 distance, therefore the model yields a symmetric space; consequently, a directional distinction between ‘Positive’ and ‘Negative’ is not applicable here.

Regarding my question on how to proceed, @Mubsi correctly suggested Option B.

I would like to discuss a potential extension: Is it possible to design a loss function that enforces specific vector arithmetic, similar to GloVe embeddings (e.g., king - man + woman = queen) or positive/negative vector magnitude without altering the underlying model architecture or adding a classification head?

My intuition suggests to incorporate a concept of ‘Greenery Density,’ we need something inspired by the ‘MagFace’ paper. MagFace analyzes feature vectors based on both direction and magnitude, where magnitude indicates face quality. Similarly, in our case, vector magnitude could proxy for ‘Greenery Density.’ This would require modifying the loss function rather than adding a separate regression or classification model.

It is great to see you digging deeper into this! Your intuition is spot on.

Option B was simply one standard alternative, and your proposed extensions are all implementable. You could absolutely design a custom loss function to enforce specific geometric constraints, such as using vector magnitude to represent “Greenery Density” (similar to the MagFace approach) or enforcing specific vector arithmetic to capture directionality directly within the embedding space. This approach would effectively bypass the need for a secondary classification head, so I definitely encourage you to experiment with defining those custom loss terms.

1 Like