I have my data in a bundle of pdf, documents, etc. Is there any way to extract data from them and generate instruction dataset for instruct fine-tuning?
Could you please confirm which type of data you have in pdf/other documents. Is it only tabular or textual informations?
It is a collection of text books and they contain all types of data including tables, images, text, etc.
We can use OCR tools to extract textual data from image. For example : tessarct library in python.
You may also look into the below article
Please let us know if it resolves your query.