Using BridgeTower from Hugging Face produces different results

selili · October 1, 2024, 12:52pm

I tried running the Lesson 2 codes from my Mac laptop but I got blocked as I don’t have Prediction Guard API key.

As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel.

I refactored the bt_embedding_from_prediction_guard in utils.py as below:

def bt_embedding_from_prediction_guard(prompt, base64_image):
    # # get PredictionGuard client
    # client = _getPredictionGuardClient()
    # message = {"text": prompt,}
    # if base64_image is not None and base64_image != "":
    #     if not isBase64(base64_image): 
    #         raise TypeError("image input must be in base64 encoding!")
    #     message['image'] = base64_image
    # response = client.embeddings.create(
    #     model="bridgetower-large-itm-mlm-itc",
    #     input=[message]
    # )
    # return response['data'][0]['embedding']

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = {"text": prompt}
    
    if base64_image:
        inputs["images"] = base64_image

    # Preprocess the inputs
    processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")

    # Generate the embedding
    with torch.no_grad():
        outputs = model(**processed_inputs)
    
    # Extract the embeddings (you can change which embedding layer to use depending on your task)
    embeddings = outputs.pooler_output

    return embeddings.tolist()  # Return the embeddings as a list for easier use


# encoding image at given path or PIL Image using base64
def download_image(image_path_or_PIL_img):
    if isinstance(image_path_or_PIL_img, PIL.Image.Image):
        return image_path_or_PIL_img
    else:
        # this is a image_path
        with open(image_path_or_PIL_img, "rb") as image_file:
            image_data = image_file.read()
            image = Image.open(BytesIO(image_data))
            return image

And in L2_Multimodal Embeddings.ipynb, I changed the Compute Embedding block as below:

embeddings = []
for img in [img1, img2, img3]:
    img_path = img['image_path']
    caption = img['caption']
    # base64_img = encode_image(img_path)
    img = download_image(img_path)
    embedding = bt_embeddings(caption, img)
    # # embeddings.append(embedding)
    embeddings += embedding

I got the code working but the cosine similarity returns:

Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778

The cosine similarity is so much higher than the results from the class:

Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604

What’s getting wrong in here?

Deepti_Prasad · October 1, 2024, 4:39pm

hi @selili

are the images used from hugging face transformer same as used in the lessons from course??

regards
DP

selili · October 2, 2024, 1:11am

Hi DP, yes the images were exactly the same as I used exactly the same code snippets, just changed the ones as mentioned in the post.

freesysck · October 6, 2024, 1:56pm

The shape from _getPredictionGuardClient is 512 different from 2048 with using transformers model BridgeTowerModel output.

Deepti_Prasad · October 6, 2024, 2:03pm

hi @freesysck

I was suspecting the same. by any chance did you also try this?

you can two share your output here, so it can help other learners too.

Thank you
DP

freesysck · October 6, 2024, 2:14pm

From https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/3/multimodal-embeddings you can check the result from getPredictionGuardClient is 512.

Run the code provided in the loop,result will be 2048

Deepti_Prasad · October 6, 2024, 2:17pm

thank you @freesysck and I think that answer your query on why the cosine similarity differs @selili

Topic		Replies	Views
LVLM from hugging face is needed Multimodal RAG: Chat with Videos ai-discussions	0	16	October 18, 2024
Prediction Guard API Key Multimodal RAG: Chat with Videos	16	232	November 5, 2024
Replicating Chatbot Implementation with HuggingFace Open-Source Models LangChain: Chat with Your Data	0	310	March 22, 2024
Do we need any Hugging face Api key Open Source Models with Hugging Face short-course	4	251	October 15, 2024
Running locally using transformers NLP with Classification and Vector Spaces week-1	3	103	July 10, 2024

Using BridgeTower from Hugging Face produces different results

Related Topics