What is the best option, if we need a model for a downstream task in another language from the base model? Fine tuning the model for the downstream task or pre-training the model first with the target language?
thanks
What is the best option, if we need a model for a downstream task in another language from the base model? Fine tuning the model for the downstream task or pre-training the model first with the target language?
thanks
Or do we have another option?
If I had that challenge, I would probably start with the fine-tuning of a base model. It is a faster, lower cost path.
If the results are not what I expect, then I would go into a full re-trainings.
Re-training a base model can be a very big task, depending on the size of the model. So may be if your have access to the resources to do this (big expertise, lots of information and the computer power), this will be your ideal option.
In summary, I would definitively try first with fine-tuning.
so your suggestion would be?
base model (fine tuning with target language) → fine-tuned model, then
fine-tuned model (fine tuning for downstream task) → final model
Yes, I would try this first of all. Thank you for the feedback and confirmation!
This is also very interesting to me. I’ve built a Q&A system with open LLM (like Flane-UL2) as a backbone and llama-index for information retrieval. It works perfectly with English documents, but for Polish (my native language) ones, it understands its content but answers only in English.
I’m thinking about fine-tuning the base LLM on the Polish language corpora. The goal is to provide the model with a skill of answering in Polish to questions, for which answers are within ingested documents. So, I don’t want to transfer a new knowledge during fine tuning, but just only linguistic skills…
I know that LLM fine-tuning may be extremely expensive, so I wonder what would be the optimal size of the Polish language corpora? Would a Alpaca-like 50K instructions be enough? Or should I aim at the Polish Wikipedia dump instead?
I would appreciate Your advice, including some infrastructure recommendations and cost approximation (just the order of magnitude). Or any tips from Your experience,
Many thanks in advance,
Andy
Hi @wodecki ,
First, I am not an expert in this task. Have never done it.
From my understanding of things I would say this:
If my end goal is to answer questions on documents of s specific domain, I would focus on a language corpus that is used in that domain. This may reduce the size of the whole corpus vs using the entire Polish corpus.
One challenge that I see is that for translation you’ll typically need to have lots of samples of English to Polish or Polish to English - and for this, as you say, may be Wikipedia can help.
I would start with an initial, small fine-tuning, like to test the waters.
As for the size of the base model, there are very good options already. Check the Huggingface leaderboard of base LLMs. At the end of the day, I think that this will depend very much on your resources: The bigger the base model, the more compute power and time you’ll need.
Training in an complete new language seems like a big task. Please share your experience!
Hi @Juan_Olano,
many thanks. I will try first with Falcon-7b-instruct - and will share my experience when it’s done
Andy