Storing Embeddings

In the Understanding and Applying Text Embeddings Course, for existing questions, the embedding is stored in pickle file and then loaded and finally stored as dataframe.
My question is when working in Production environment, what would be the ideal approach to store embeddings. Shall I maintain a dataframe(csv file) like in suggested in the course (I doubt it).
I am downloading the incidents from our portal and currently have 5000 rows of closed incidents and daily ~10-30 new incident (open) will come, and I need to suggest the incidents (closed one) which are similar to new incident.

Hi @rakeshkuwar09 I havent worked with embeddings in production environment but overall you always want to store everything in database so everyone has access to the same information.

I hope this helps

1 Like

Hi @rakeshkuwar09

infrastructure-wise: After all your system architecture is also dependent on your exact requirements, e.g. with respect to:

  • business needs, e.g. whether or not your application needs to be real-time capable etc.
  • other architectural building blocks you employ. E.g. I am not sure if you are calculating the embeddings on your own or are you using standard APIs, like this one.

But it sounds like you are looking for a more scalable solution.

Habe you considered vector data bases?

Hope that helps!

Best regards
Christian

Hi @Christian_Simonis,

Thank you for your response. Allow me to rephrase my questions and provide additional details:

  1. I am utilizing the GCP model textembedding-gecko@001 to generate embeddings for an initial set of 5000 incidents, following the approach outlined in the course notebook L6-semantic-search, where the model API is called in batches.
  2. My current focus is on building solutions for batch processing.

Existing Approach: Initially, I developed a logic to download all existing incidents, saving them as a CSV file (5000 records). Subsequently, I created embeddings using the aforementioned model and stored them in a pickle file.

In Week #1: I will receive a new set of 100 incidents as a CSV file. My plan is to load it into a dataframe and generate embeddings on-the-fly. Additionally, I’ll load the previously created embeddings for the initial 5000 incidents and the corresponding dataframe (incidents_df). The embeddings will be added as a new column, as illustrated below:

import pickle
with open('question_embeddings_app.pkl', 'rb') as file:
    question_embeddings = pickle.load(file)
    print(question_embeddings)

incidents_df['embeddings'] = question_embeddings.tolist()

Finally, leveraging SCANN, I intend to identify the top 3 similar incidents for the new set of 100 incidents (new_df).

Blocker/Point of Concern with the Current Approach: The challenge arises when I need to append the new embeddings to the existing incidents_df['embeddings']and save the updated embedding as a pickle file. As the embedding size increases over time, this process becomes cumbersome to manage.

While exploring solutions, Vector Databases seem promising. However, I’m encountering difficulty finding relevant documentation, especially from a GCP perspective.

I appreciate your insights and guidance on addressing this concern and any recommendations regarding Vector Databases within the GCP ecosystem.

Hi @pastorsoto Certainly, storing embeddings in a database is a viable option, but I’m exploring scalable and easy-to-maintain solutions for the long term. Do you have any recommendations or insights in that regard?

Ok thanks! Very clear description!

Only one hint: GCP also works with the Databricks platform (2nd link I posted above).

Also the open source delta technology might be worth a look in this context, considering your current tech stack, see also: What is Delta Lake? | Databricks on Google Cloud, enabling ACID transactions as well as time travelling which could be interesting for your use case.

Best regards
Christian