The best method for adding or removing new documents to embeddings

Hello there,
Thanks for the excellent courses. I have gained a lot of knowledge from them.

The concept of embedding is incredibly useful. After experimenting with it, I have a question about adding a document to or deleting a doc from the existing embeddings.

Suppose, I have the following code to create the embeddings.

docs = ["I like the dachshund.", "I love my dog.", "Today is cold outside."]
embeddings = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs)

Then, the issue arises if I want to add a new document, ‘The weather is terrible today.’, to the embeddings. I am aware that I can create new embeddings again. However, I am curious if there is a more efficient method that only operates on the new document and not the existing one.

docs = ["I like the dachshund.", "I love my dog.", "Today is cold outside.", "The weather is terrible today."]
embeddings = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs)

Also, to the query about removing a single document from the existing embeddings. Do we need to create new embeddings, or is there a more efficient method?

Thanks!

1 Like

After conducting more tests, I have some thoughts on this. I found that the embedding of identical text retains the same value regardless of how it’s embedded, whether in the documents or as a separate text.

When adding new text to the documents, we can embed it separately and get its embedding. For instance, OpenAIEmbeddings(chunk_size=1000).embed_query(sentence).

Below is some testing for reference.

sentence0 = "i like dogs"
sentence1 = "i like canines"
sentence2 = "Today is cold outside"
sentence3 = "The weather is terrible today"

# Embed the docs list
docs0 = [sentence0, sentence1, sentence2]
e_docs0 = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs0)

## Add the sentence3 to the docs list and embed the docs again
docs1 = [sentence0, sentence1, sentence2, sentence3]
e_docs1 = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs1)

## Embed only the sentence3 as a docs list
docs2 = [sentence3]
e_docs2 = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs2)

# Get the embedding of each single sentence
embedding = OpenAIEmbeddings(chunk_size=1000)
e_sentence0 = embedding.embed_query(sentence0)
e_sentence1 = embedding.embed_query(sentence1)
e_sentence2 = embedding.embed_query(sentence2)
e_sentence3 = embedding.embed_query(sentence3)

import numpy as np

# The embedding of a single sentence is the same as the embedding of the same sentence within the doc's embedding
print(np.dot(e_sentence0, e_docs1[0])) # 0.9999972593791817
print(np.dot(e_sentence1, e_docs1[1])) # 1.0000000339545314
print(np.dot(e_sentence2, e_docs1[2])) # 0.9999984148089067
print(np.dot(e_sentence3, e_docs1[3])) # 1.0000000209384206
print(np.dot(e_sentence3, e_docs2[0])) # 1.0000000209384206

# Each sentence's embedding in the different docs embeddings is identical.
print(np.dot(e_docs0[0], e_docs1[0])) # 0.9999972835459958
print(np.dot(e_docs0[1], e_docs1[1])) # 1.0000000000000007
print(np.dot(e_docs0[2], e_docs1[2])) # 1.0000000000000004