[How to] Generate dataset from pdf/documents?

I have my data in a bundle of pdf, documents, etc. Is there any way to extract data from them and generate instruction dataset for instruct fine-tuning?

Could you please confirm which type of data you have in pdf/other documents. Is it only tabular or textual informations?

It is a collection of text books and they contain all types of data including tables, images, text, etc.

We can use OCR tools to extract textual data from image. For example : tessarct library in python.

You may also look into the below article

Please let us know if it resolves your query.