How can I clean scraped txt data for fine tuning?

fkuyumcu · August 22, 2024, 9:36pm

Hello, I’ve scraped schoolbook PDF with PyMuPDF to a textfile. But schoolbook has tables, table of contents, tests etc. So I want to clean this data. Which method should I use to clean this data for creating fine-tuning dataset?

gent.spah · August 23, 2024, 7:09am

Maybe using a LLM might help you in ether providing guidance steps or performing some of the process you require!

Topic		Replies	Views
How to create a dataset from the excel or pdf files and fine tune the LLM for a specific task AI Discussions ai-discussions , data-centric , llm	1	462	June 27, 2024
Dataset for fine tuning SLM AI Discussions data-centric	0	9	March 28, 2025
Course 5 Week 4: clean_dataset() is buggy? in the Named-Entity Recognition notebook Sequence Models	7	570	April 6, 2022
[How to] Generate dataset from pdf/documents? Finetuning Large Language Models	3	233	September 4, 2023
MEMORY FINETUNNING: Data preparation for Chat. I only have long chunks of proprietary text data Improving Accuracy of LLM Applications	0	31	August 16, 2024

How can I clean scraped txt data for fine tuning?

Related topics