Langchain - How to get the "Source Documents" information as output?

Hi all,

I am currently using Langchain and OpenAI API to build a custom chatbot based on a dataset that consists of different attributes/ features (both qualitative and quantitative) of users such as age, education, height and self-introduction etc.

Essentially, I would like the chatbot to return the name/ profile of a user(s), based on the query entered in a prompt (e.g. I am looking for a university graduate that loves football and hanging out with friends). And that chatbot should avoid hallucination to answer any questions that is irrelevant to this database.

I am using Pinecone to store the vector embeddings and have taken reference from the Langchain doc in the link: Pinecone | 🦜️🔗 Langchain

In the response/ output, is there any way I could retrieve the data in the “Source Documents” as the output, as Langchain only showed me its thought process and how the final output is produced, based on evaluating various most similar “Source Documents”, which are in this case the user profiles. But there is no way I can get the Source Documents as part of the output/ data that I can manipulate with. Essentially, I want to get some of the data in its “thinking process” for further processing.

Is there any way to achieve this? Or any other Agents/ tools of Langchain that can help do this?

Many thanks!

Hi @jacobarsenal , if I am understanding correctly, this would be my initial approach to it.

From what I understand, you have source documents and you are taking from these the profile information of some people. Then you use this profile information to create an embeddings database that you will later query with some criteria.

If you want to retrieve the user profile and also be able to access the source document of that user profile, what you may want to do is add to the embedding the Id of the document, or even the path of the document, so when you search in the vectors database, the vectors that you retrieve have this information in them. From there you just have to take the id of the document and reference it in a documents table to get its path, or if you embedded the actual path of each document, now you have the path and you can use it to access the document and retrieve more information, or provide a link to the document.

Thoughts?

Hi Juan,

Thanks a lot for your reply! Maybe let me explain a little bit about what we are trying to do.

Please refer to below as an example.

User’s enquiry: Hey can you give me one user in the database that is at least Xcm tall and weighs Ykg and also has feature Z?
AI Assistant output: Sure, User A would match your criteria and here is the full profile of User A.

In this case, we already know which user profile the Chain uses in its thinking process by reading the terminal/ setting the variable Verbose = True. But we cannot retrieve those user profiles in its “thinking process”/ intermediate step, as it is not available in the output.

if I set the chain’s K value (which denotes the number of outputs the chatbot will return) to 10, it will not only return the completion output, but also give us another set of data, called source documents. This means it tells us which 10 User Profiles it used as references to produce the output, by comparing and ranking those 10 profiles to give us the one that best matches our input criteria and in this case - User A. In essence, the agent/ chain utilises all 10 user profiles in its “thinking process” to derive a final completion/ output for us.

However, the completion output may not directly mention these 10 sources/ user profiles, or it might not mention the existence of these 10 sources at all. We cannot control this aspect in the completion output. But we would like to obtain those data (user profiles in this case) in its “thinking process”. Do you think there is any agent/ tool within Langchain that enables this capability?

Sorry for the confusion and happy to share more about the code base, if it is helpful.

Thanks again for your help!

I see what you mean.

Well, that’s probably one of the ‘issues’ of using a framework like langchain: you are subject to the idiosyncrasies of the framework. In this case, the framework decides not to show their candidates as you would want to see them.

Have you considered implementing this solution using a different approach?

If I understand it correctly your task is rather simple to implement:

  1. Create your dataset
  2. Pre-process your data
  3. Create an embeddings database. There are many python libraries for this. And you can use Chroma to store it.

From here:

  1. Get a user’s query
  2. Embed it using the same method used above.
  3. Find proximity using, for example, cosine similarity. There are many libs for this as well.

This will return you the similars, and you can sort in order. You’ll have all the info of each candidate.

This is more or less what langchain does behind the scenes.

Do you think this is useful? If I am still lost, I apologize :slight_smile: