Finetuning - length of data

Hi All,
I am starting on a finetuning a LLM for a translation task for specific domain. Could someone please advise which approach is better - using data of longer length (e.g. large sentences , say closer to context length) or using shorter length data points (smaller sentences). The former is easier for me, while the later seems more intuitive.


I guess it should be a mixture so it can learn from small sentences and longer contexts too. But In any case the longer texts are always more beneficial because not only they contain more information to learn but also the model can learn dependencies on words better.

Thanks! When we attempted finetuning with smaller data point size some time back, it didnt seem to scale well for large text. That was with older models of OpenAI, so we were not sure about whats playing out. So wanted some advise on attempting the same again with current models.