Summarizing across documents

happyday · July 7, 2023, 11:55am

The examples in the short course were targeted questions that were more easily answered by chunking. Unless we use the refine mapping, it seems to me the summary will be lacking (and even then). What do you see as the best way of handling summarization of many (many) documents? It seems the methods presented do not handle summarization very well. What do you use that works for summarization and Chat? Thank you.

Juan_Olano · July 7, 2023, 1:45pm

It is a bit of art and science. Here I can provide a couple of tips:

You’ll have to play with the size of your chunks, sometimes smaller chunks help capture more details.
Try to add some metadata (titles and other keywords for instance). Include these keywords in your search queries.

Try this and share your results!

happyday · July 7, 2023, 4:13pm

I probably will at some point. But here’s the rub…why not different indices ala llamaindex’s Summary Index ? It worked well for me. I was looking at my daughter’s Nurses contract and used the summary index to first get an idea of the contents of the doc. The reply I got was: "This document is an employment agreement between EvergreenHealth and the Washington State Nurses Association that outlines the wages, hours of work, and conditions of employment for nurses employed by EvergreenHealth. It covers topics such as membership and dues, management rights, definitions, employment practices, seniority, layoff and recall, hours of work and overtime, compensation, holidays, vacations, sick leave, leaves of absence, employee benefits, committees, no strike-no lockout, grievance procedure, and general provisions. " which is better than what I have gotten with my bundled attempts with the vector index. Then switching to the vector index gave great qa results. I must not understand this well enough because it seems to me chunking could be a “black box” that just does the right thing 99.5% of the time. With that said, I will definitely want to explore here. I am very excited about the metadata and perhaps markdown indexing. Thank you.

Juan_Olano · July 7, 2023, 9:02pm

When you reference the LlamaIndex, whch one of the two types of retrievals you tested? the LLM-based retrieval or the Embedding-based retrieval?

From the content you share, it looks very much like an LLM-based retrieval. Please correct me if I’m wrong.

happyday · July 7, 2023, 10:22pm

Well, to me at this point I don’t care given it makes total sense just to summarize this way. However, re: "Retrieval can be performed through the LLM or embeddings (which is a TODO). " see docs…

as you point out, at this point LLM… THANK YOU very much for all your patient and kind help. I have learned, Thank you.

Juan_Olano · July 7, 2023, 11:29pm

Hi @happyday , thank you for getting back.

Actually it is an important distinction and decision. It will be very different to summarize based on LLM processing of the entire document, or even chunks of the document, vs summarizing passing similar chunks based on embeddings.

In my experience, the best result can be obtained most of the times with a 100% LLM-based summary, but embeddings, can be a close match and at a much lower cost.

happyday · July 8, 2023, 11:05am

Hi @Juan_Olano ,
Thank you VERY much for your insights. Personally, I’m drinking from a firehose with all of this. These courses, your comments have been an extreme source of happiness as I figure out how all of this can benefit my family. I wanted to “tackle” summary because a lot of what we might deal with (in our family) are contracts. Could be HR: “It’s in the contract…” to a parent’s contract for assisted living. I wanted a way to simplify the contents and make them understandable. The first thing though, is what is this thing covering and what general questions can I ask? I thought this is a good way to get confortable, then turn to a chat model. Then I started getting concerned with LLM cost to openai (also down the road, all these TPUs and CPU cycles need to be conserved unless we have given up on this planet?). So perhaps the model will evolve to a lot of indices spread with a lot of smaller LLMs very capable of summarizations then followed up with chat with the big brains. The “muliprocessing” approach. The summarization may not be as useful in a true ChatQA ala Eliza.

Juan_Olano · July 8, 2023, 11:31am

I am very glad that you are feeling that way. If there is any thing I can assist with, just let me know!

happyday · July 8, 2023, 12:21pm

Thank you @Juan_Olano for your kindness. With all this AI emphasis, I am so very glad DeepAI has such knowledgeable and helpful people such as yourself. Isn’t it delightful? That spark of happiness when we learn or meet something/someone who is absolutely fascinating - like these indices and llms…

Topic		Replies	Views
? on using Metadata? LangChain for LLM Application Development	0	98	August 11, 2023
Summarizing text from the 1000 sentences in an Excel file LangChain for LLM Application Development	1	179	June 26, 2023
Document splitting: Chunksize LangChain for LLM Application Development	0	101	July 6, 2023
Using llamaindex to augment langchain? LangChain for LLM Application Development langchain	1	187	September 7, 2023
Tactics for summarization of long documents LangChain for LLM Application Development	0	160	June 14, 2023

Summarizing across documents

Related topics