Hello, I’ve scraped schoolbook PDF with PyMuPDF to a textfile. But schoolbook has tables, table of contents, tests etc. So I want to clean this data. Which method should I use to clean this data for creating fine-tuning dataset?
Maybe using a LLM might help you in ether providing guidance steps or performing some of the process you require!
1 Like