Building a chatbot using langchain

hello everyone, I have a question, I`m new to the language model topic and would like to ask a question. I want to build a chatbot that answers questions from a pdf manual. Is that feasible to do using Langchain? and how will it train on this manual?

Yes, it is feasible to do using langchain. You’ll need to approach this with embeddings and a vectors database as your main structural components. Then you’ll want a process that retrieves vectors from the database using some proximity algorithm like cosine-similarities, and then, for a nice output, you’ll want to pass the vectors to an LLM to get a nice output for the user.

There are many examples in the internet. I can share one I did with ‘The Little Prince’:

Talking to a book

Hope this helps!

1 Like

Hi
I guess that the new short course released by DeepLearning.ai called “LangChain Chat with Your Data” addresses exactly what you are interested in. Take a look at it

Best regards
Fabio

1 Like

Thanks a lot. I looked at your code and indeed it helps but it is nltk and I wanted langchain.

Thank you so much. I will look at the course but I think it is exactly what I want to do

1 Like

Do you think that with langchain, a chatbot can lear rules? For example, I want to create a chatbot to decide whether a person is eligible for something. The rules are written in a pdf, is that feasible or not?
Thank you

I think that you can create the chain of actions to have the bot do that, yes.

You need to take your PDF into a vectorized database. Split it in chunks and create small chunks that become vectors.

Get the question or comment from the user

Vectorize it and look for similarities in the vectors database

Pass the user input and the resulting vectors to the LLM and ask the LLM if according to the information, the user is eligible or not.

Note of caution: If this is a critical mission process, I would not do it like that. I would have a human in the loop. The answer of the system will not always be perfect. Test it, use metrics to see the quality, and think how the technology can assist in your use case with the end goal of producing a solution that is safe for humans.

3 Likes

Hi @Juan_Olano , I’m sorry to nudge you but I went over your repo and it seems pretty close to what I want to accomplish except I’m working with sensitive data and am wondering if you have any tips about safeguarding the data (PDF). Are there any open-source tools I could use/course that covers this that you know off? I mean, I know I can set up credentials, etc but I don’t know if OPENAI can access/use my data so I’m looking for ways to keep it safe while using the API - any tips would help.
Thanks

Hi @edsonzandamela ,

For this project you may have 2 big components:

  1. The vector database where your knowledge will reside. This can be local 100%. Use one local vector database like ChromaDB (make sure to turn off the telemetry so no usage info is sent to their servers). Here your data will never leave your premises.

  2. The LLM that augments the data selected from the vector database. In this case you have 2 options:
    a) use one LLM via an API like OpenAI. In this case your data will leave premises. As per OpenAI statements, they are not using this data for training.
    b) use a local LLM. The power of this solution will depend very much on your resources. You can try with a small LLM of, say, 7B parameters, and see if the results are satisfactory. This size of LLM doesn’t need much resources, but still it is not a simple computer either. May be one GPU of 24Gb, and as much RAM and processor as you can get. In this case, your data will be 100% in premises, but again, being a small LLM model, the augmentation will not be of the level of GPT. May be it works.

You may want to test with just the vector database, allow the user to enter any question, and have your system return the most probable answers. A little bit like ‘google search’. In this case, I recommend using “SentenceTransformer”. I’ve had wonderful results with it.

1 Like

Thank you for getting back to me @Juan_Olano

On the first component:

  1. Say I use ChromaDB or an internal DB to store the vector representation of my data for example. Wouldn’t there be potential for exposure when I use the API KEY to send prompts and retrieve the response from the model?

On the second component

  1. I was inclined towards option a) due to its time-saving benefits and avoidance of model training etc. But as much as they say they won’t use the data I’m still skeptical to use sensitive data so I thought people were using other tools to safeguard the data while interacting with the GPT model.

Option b) seems to be the most secure, but it will require a lot of effort. If you know of great resources for exploring “SentenceTransformer,” I’d greatly appreciate your insights and recommendations.

@edsonzandamela ,

Thanks for the reply.

ChromaDB doesn’t need an API KEY nor this solution needs prompts to retrieve answers. This is like ‘google’. Your docs, or better, your docs’ chunks, are stored locally in your ChromaDB as vectors created with some embedding, which can also be local. Then you enter a question, convert your question into a vector with same embedding matrizx, query ChromaDB (like a SELECT in a relational db more or less), and show the results to the user. This option is 100% local. No data going out at all.

I understand your hesitation - once data leaves your control, anything can happen. For option b), the bigger the model for augmentation (meaning, to take the results of your ChromaDB and make them a very good readable output), the better. But again, I would try with a small model first and see how it goes. I would recommend testing this with Llama2-7B params. I’ve seen very nice results out of this small model!

SentenceTransformers is just a library that will allow you to create an embedding matrix with your data - once the info is encoded, you can ask questions to it and get very good responses in the form of your decoded data (no augmentation). I am working now in a project where we are taking the immigration information to Canada, which is complex and extense, and I built a quick prototype with SentenceTransformer to answer questions about immigration. It works like a charm. I am sure there are many sample codes out there. Check it out and let me know if you end up needing some additional help .

Cheers,

Juan

2 Likes

Blockquote
ChromaDB doesn’t need an API KEY nor this solution needs prompts to retrieve answers. This is like ‘google’. Your docs, or better, your docs’ chunks, are stored locally in your ChromaDB as vectors created with some embedding, which can also be local. Then you enter a question, convert your question into a vector with same embedding matrizx, query ChromaDB (like a SELECT in a relational db more or less), and show the results to the user. This option is 100% local. No data going out at all.

Blockquote

Thank you so much for clarifying this @Juan_Olano I had completely missed the point but when you explained it again I was able to go back and poke around with a few things and am super happy to say that I was able to get something to work. I appreciate your detailed insights on this, I had completely missed the point earlier and thought that I had to go the other way which was way more difficult.
Thank you so much again, I really appreciate your insights!

2 Likes

I am super happy that I was able to provide a solution to your query! I look forward to your success!

1 Like