Embedding of paragraphs, documents

skgadalay · August 22, 2025, 9:57am

I know embeddings are generated for words or tokens. How these embeddings are carried for document level or chunk level or paragraph embeddings. Is it sum or mean of all the words/tokens in that document/chunk/paragraph?

gent.spah · August 22, 2025, 10:40am

Its a good question, I also seached about it and here is a summary of what happens from model to model:

For raw word embeddings (Word2Vec/GloVe) → usually mean pooling (sometimes sum).

For transformers → either use the [CLS] token embedding or mean-pool over token embeddings. After passing the sequence through the transformer, the embedding of [CLS] is treated as the sentence/document embedding (because self-attention lets it gather context from all tokens).

For modern embedding models (like OpenAI embeddings) → you don’t need to combine; the model already gives a single vector per chunk/document.

skgadalay · August 22, 2025, 11:11am

Thanks @gent.spah

I also thought mean based. But I have a doubt. My prompt/question embedding is also mean based of its words/tokens. The prompt has very few words/tokens where as document or paragraph has many words/tokens. If we generate a single representation like this, will vector index search by approximate nearest neighbor will be hit? Or is it doing the old way of key word search?

Topic		Replies	Views
The best method for adding or removing new documents to embeddings LangChain: Chat with Your Data	1	165	July 11, 2023
ColBERT and token - level chunking Retrieval Augmented Generation week-module-3 , coursera-platform	1	32	September 30, 2025
Embedding Model and QA Model LangChain: Chat with Your Data	0	37	September 2, 2024
How a vector representation is created for a document? Generative AI with Large Language Models week-module-3	0	121	June 2, 2024
Summarizing across documents LangChain for LLM Application Development	8	511	July 8, 2023

Embedding of paragraphs, documents

Related topics