Embedding of paragraphs, documents

I know embeddings are generated for words or tokens. How these embeddings are carried for document level or chunk level or paragraph embeddings. Is it sum or mean of all the words/tokens in that document/chunk/paragraph?

Its a good question, I also seached about it and here is a summary of what happens from model to model:

For raw word embeddings (Word2Vec/GloVe) β†’ usually mean pooling (sometimes sum).

For transformers β†’ either use the [CLS] token embedding or mean-pool over token embeddings. After passing the sequence through the transformer, the embedding of [CLS] is treated as the sentence/document embedding (because self-attention lets it gather context from all tokens).

For modern embedding models (like OpenAI embeddings) β†’ you don’t need to combine; the model already gives a single vector per chunk/document.

1 Like

Thanks @gent.spah

I also thought mean based. But I have a doubt. My prompt/question embedding is also mean based of its words/tokens. The prompt has very few words/tokens where as document or paragraph has many words/tokens. If we generate a single representation like this, will vector index search by approximate nearest neighbor will be hit? Or is it doing the old way of key word search?