[How to] Generate dataset from pdf/documents?

mahimairaja · September 4, 2023, 1:15pm

I have my data in a bundle of pdf, documents, etc. Is there any way to extract data from them and generate instruction dataset for instruct fine-tuning?

raj · September 4, 2023, 3:10pm

Could you please confirm which type of data you have in pdf/other documents. Is it only tabular or textual informations?

mahimairaja · September 4, 2023, 3:32pm

It is a collection of text books and they contain all types of data including tables, images, text, etc.

raj · September 4, 2023, 3:50pm

We can use OCR tools to extract textual data from image. For example : tessarct library in python.

You may also look into the below article

Please let us know if it resolves your query.

Topic		Replies	Views
Help with pdf data AI Discussions project	7	161	September 12, 2024
PDF with tabular data AI Discussions ai-discussions , project	9	2175	March 22, 2024
Extracting images from PDF Preprocessing Unstructured Data 4 LLM Applications	0	123	April 15, 2024
How can I clean scraped txt data for fine tuning? AI Discussions ai-discussions	1	25	August 23, 2024
How to work with PDF files that has tables in it? LangChain: Chat with Your Data	2	248	July 28, 2023

[How to] Generate dataset from pdf/documents?

Related topics