Fine-tuning to another language

bestari · July 13, 2023, 4:20am

What is the best option, if we need a model for a downstream task in another language from the base model? Fine tuning the model for the downstream task or pre-training the model first with the target language?

thanks

bestari · July 13, 2023, 4:26am

Or do we have another option?

Juan_Olano · July 13, 2023, 1:20pm

If I had that challenge, I would probably start with the fine-tuning of a base model. It is a faster, lower cost path.

If the results are not what I expect, then I would go into a full re-trainings.

Re-training a base model can be a very big task, depending on the size of the model. So may be if your have access to the resources to do this (big expertise, lots of information and the computer power), this will be your ideal option.

In summary, I would definitively try first with fine-tuning.

bestari · July 14, 2023, 1:21am

so your suggestion would be?

base model (fine tuning with target language) → fine-tuned model, then
fine-tuned model (fine tuning for downstream task) → final model

Juan_Olano · July 14, 2023, 3:46pm

Yes, I would try this first of all. Thank you for the feedback and confirmation!

wodecki · July 15, 2023, 1:54pm

This is also very interesting to me. I’ve built a Q&A system with open LLM (like Flane-UL2) as a backbone and llama-index for information retrieval. It works perfectly with English documents, but for Polish (my native language) ones, it understands its content but answers only in English.

I’m thinking about fine-tuning the base LLM on the Polish language corpora. The goal is to provide the model with a skill of answering in Polish to questions, for which answers are within ingested documents. So, I don’t want to transfer a new knowledge during fine tuning, but just only linguistic skills…

I know that LLM fine-tuning may be extremely expensive, so I wonder what would be the optimal size of the Polish language corpora? Would a Alpaca-like 50K instructions be enough? Or should I aim at the Polish Wikipedia dump instead?

I would appreciate Your advice, including some infrastructure recommendations and cost approximation (just the order of magnitude). Or any tips from Your experience,

Many thanks in advance,

Andy

Juan_Olano · July 15, 2023, 3:31pm

Hi @wodecki ,

First, I am not an expert in this task. Have never done it.

From my understanding of things I would say this:

If my end goal is to answer questions on documents of s specific domain, I would focus on a language corpus that is used in that domain. This may reduce the size of the whole corpus vs using the entire Polish corpus.
One challenge that I see is that for translation you’ll typically need to have lots of samples of English to Polish or Polish to English - and for this, as you say, may be Wikipedia can help.
I would start with an initial, small fine-tuning, like to test the waters.
As for the size of the base model, there are very good options already. Check the Huggingface leaderboard of base LLMs. At the end of the day, I think that this will depend very much on your resources: The bigger the base model, the more compute power and time you’ll need.

Training in an complete new language seems like a big task. Please share your experience!

wodecki · July 15, 2023, 3:34pm

Hi @Juan_Olano,

many thanks. I will try first with Falcon-7b-instruct - and will share my experience when it’s done

Andy

Topic		Replies	Views
What is the best way to fine tune the LLM? NLP with Attention Models week-4	2	457	October 11, 2023
Fine-Tuning a Large Language Models with QLoRA and PEFT(LLMs) Generative AI with Large Language Models week-2	11	723	July 15, 2023
Week 2: Instruction fine-tuning Generative AI with Large Language Models llm , prompting	1	58	November 18, 2024
After completing this short course, how should one proceed to further deepen the Finetuning process? Finetuning Large Language Models	0	99	October 7, 2023
Can you mix and match different types of data? Finetuning Large Language Models	2	113	September 21, 2023

Fine-tuning to another language

Related topics