How to extract arbitrary data and store them into a vector database, and a LLM can answer any questions based on the vector database

xhhuango · July 18, 2024, 7:15am

Hi everyone,

I’m a newbie for AI. Recently, I’m studying how to use langchain to load a LLM, extract data, save them into a vector database. Finally, I can ask the LLM questions, and it can answer me based on the vector database. It seems like everything is so far so good.

However, now I am thinking a bigger project. A company has a lot of data that is stored in different files, including .docx, .pptx, .xlsx, and .pdf. These data can be but not limited to

the company rules (.docx)
the company news (.pdf)
the company’s project presentation (.pptx)
the company’s employees’ salary/KPI records in the last 10 years (.xlsx)
the company’s revenue records and charts in the last 10 years (.xlsx)
and more, more

Do I just extract all texts from these files and just save them into a vector database, and LLM can just understand them?

Especially, tables in .xlsx only contain short words like name and numbers like salary. Can LLM just understand these data?

If not, what should I do?

If this company has so many .xlsx, and these .xlsx contain different numbers and information in different things?

Thank you so much for reading and answering my questions.

Wayne

Topic		Replies	Views
Seeking Advice: Integrating LLM with Large Local Document Databases AI Discussions ai-discussions	8	4597	January 28, 2025
LLMs chat with PDFs AI Discussions llm	2	292	January 21, 2024
RAG - Parsing and Chunking the text AI Discussions ai-discussions , langchain	0	117	May 17, 2024
Building a chatbot using langchain AI Discussions	12	196	September 4, 2023
Document Comparison LangChain: Chat with Your Data	5	1728	April 15, 2024

How to extract arbitrary data and store them into a vector database, and a LLM can answer any questions based on the vector database

Related topics