Hi everyone,
I’m a newbie for AI. Recently, I’m studying how to use langchain to load a LLM, extract data, save them into a vector database. Finally, I can ask the LLM questions, and it can answer me based on the vector database. It seems like everything is so far so good.
However, now I am thinking a bigger project. A company has a lot of data that is stored in different files, including .docx, .pptx, .xlsx, and .pdf. These data can be but not limited to
- the company rules (.docx)
- the company news (.pdf)
- the company’s project presentation (.pptx)
- the company’s employees’ salary/KPI records in the last 10 years (.xlsx)
- the company’s revenue records and charts in the last 10 years (.xlsx)
- and more, more
Do I just extract all texts from these files and just save them into a vector database, and LLM can just understand them?
Especially, tables in .xlsx only contain short words like name and numbers like salary. Can LLM just understand these data?
If not, what should I do?
If this company has so many .xlsx, and these .xlsx contain different numbers and information in different things?
Thank you so much for reading and answering my questions.
Wayne