Finetuning - length of data

Anand_Sharma1 · October 12, 2023, 6:15pm

Hi All,
I am starting on a finetuning a LLM for a translation task for specific domain. Could someone please advise which approach is better - using data of longer length (e.g. large sentences , say closer to context length) or using shorter length data points (smaller sentences). The former is easier for me, while the later seems more intuitive.

TIA.

gent.spah · October 13, 2023, 6:33am

I guess it should be a mixture so it can learn from small sentences and longer contexts too. But In any case the longer texts are always more beneficial because not only they contain more information to learn but also the model can learn dependencies on words better.

Anand_Sharma1 · October 13, 2023, 7:57am

Thanks! When we attempted finetuning with smaller data point size some time back, it didnt seem to scale well for large text. That was with older models of OpenAI, so we were not sure about whats playing out. So wanted some advise on attempting the same again with current models.

Topic		Replies	Views
Can you mix and match different types of data? Finetuning Large Language Models	2	117	September 21, 2023
Finetuning logs AI Discussions	0	99	September 22, 2023
MEMORY FINETUNNING: Data preparation for Chat. I only have long chunks of proprietary text data Improving Accuracy of LLM Applications	0	32	August 16, 2024
Week 2: Instruction fine-tuning Generative AI with Large Language Models llm , prompting	1	67	November 18, 2024
How to create dataset on a specific topic to fine tune llm? Finetuning Large Language Models	0	189	November 27, 2023

Finetuning - length of data

Related topics