Using BridgeTower from Hugging Face produces different results

I tried running the Lesson 2 codes from my Mac laptop but I got blocked as I don’t have Prediction Guard API key.

As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel.

I refactored the bt_embedding_from_prediction_guard in utils.py as below:

def bt_embedding_from_prediction_guard(prompt, base64_image):
    # # get PredictionGuard client
    # client = _getPredictionGuardClient()
    # message = {"text": prompt,}
    # if base64_image is not None and base64_image != "":
    #     if not isBase64(base64_image): 
    #         raise TypeError("image input must be in base64 encoding!")
    #     message['image'] = base64_image
    # response = client.embeddings.create(
    #     model="bridgetower-large-itm-mlm-itc",
    #     input=[message]
    # )
    # return response['data'][0]['embedding']

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = {"text": prompt}
    
    if base64_image:
        inputs["images"] = base64_image

    # Preprocess the inputs
    processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")

    # Generate the embedding
    with torch.no_grad():
        outputs = model(**processed_inputs)
    
    # Extract the embeddings (you can change which embedding layer to use depending on your task)
    embeddings = outputs.pooler_output

    return embeddings.tolist()  # Return the embeddings as a list for easier use


# encoding image at given path or PIL Image using base64
def download_image(image_path_or_PIL_img):
    if isinstance(image_path_or_PIL_img, PIL.Image.Image):
        return image_path_or_PIL_img
    else:
        # this is a image_path
        with open(image_path_or_PIL_img, "rb") as image_file:
            image_data = image_file.read()
            image = Image.open(BytesIO(image_data))
            return image

And in L2_Multimodal Embeddings.ipynb, I changed the Compute Embedding block as below:

embeddings = []
for img in [img1, img2, img3]:
    img_path = img['image_path']
    caption = img['caption']
    # base64_img = encode_image(img_path)
    img = download_image(img_path)
    embedding = bt_embeddings(caption, img)
    # # embeddings.append(embedding)
    embeddings += embedding

I got the code working but the cosine similarity returns:

Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778

The cosine similarity is so much higher than the results from the class:

Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604

What’s getting wrong in here?

hi @selili

are the images used from hugging face transformer same as used in the lessons from course??

regards
DP

Hi DP, yes the images were exactly the same as I used exactly the same code snippets, just changed the ones as mentioned in the post.

The shape from _getPredictionGuardClient is 512 different from 2048 with using transformers model BridgeTowerModel output.

1 Like

hi @freesysck

I was suspecting the same. by any chance did you also try this?

you can two share your output here, so it can help other learners too.

Thank you
DP

From https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/3/multimodal-embeddings you can check the result from getPredictionGuardClient is 512.
image
Run the code provided in the loop,result will be 2048
image

1 Like

thank you @freesysck and I think that answer your query on why the cosine similarity differs @selili