Finetuning multilingual LLMs

redrocker08 · August 25, 2023, 8:12pm

Hello,
First off all I would like to congratulate you for the courses, they are all excellent and very helpful for someone who starts now with LLMs. My question is the following:
Is there any LLM base or finetuned to follow instructions (chat) which supports the Greek language and I can finetune it further in order to answer questions about mathematical definitions and methodologies? Do you have any suggestions?

Community-Team · August 28, 2023, 2:43pm

Hello, thank you for your message!

May I add your message to the finetning category? I am sure our amazing mentors and fellow learners will gladly help.

redrocker08 · August 28, 2023, 3:47pm

Yes of course, feel free to do so.

SamReiswig · August 28, 2023, 6:46pm

Hi @redrocker08 !

Do you want the model to speak Greek?
If so, I wasn’t able to find a model that was trained specifically on Greek. Not even BLOOM had Greek in it’s training set.

However, if I’m reading your question correctly, it sounds like just having the Greek language symbols is good enough for Math definitions.

OpenAI just announced fine-tuning for gpt3.5 turbo OpenAI Platform

PALM2 has support for Greek Symbols.

LLaMA2 also has support for Greek Symbols and can be found on huggingface in various sizes.

You can also instruct these models in the prompt to output results in an intermediate format like markdown or latex which can render Greek Symbols.

Hope this helps!

Sam

redrocker08 · August 28, 2023, 11:56pm

Dear Sam, first of all thank you for your reply to my question.
Let me be more precise. My dataset consists of questions and answers (currently I have 1000 samples) written in the Greek language related to mathematical definitions and methodologies. For example let me demonstrate some translated examples. Let me note here that I can not use google translate from Greek to English since I have specific terminology in Greek (for example παραπληρωματικές supplementary συμπληρωματικές complementary) google translate confuses them.

[Q] when two angles are called supplementary?
[A] two angles are called supplementary when they sum to 180 degrees
or
[Q] How do we call two angles which sum up to 180 degrees?
[A] two angles which sum up to 180 degrees are called supplementary.

[Q] when does a triangle called isosceles?
[A] I triangle is called isosceles when it has two equal in length sides or to equal angles

[Q] how do I call a triangle which has two equally in length sides?
[A] isosceles

So I am searching for an LLM ( a multilingual perhaps?) which can speak Greek or have some knowledge of the Greek language and generate Greek sentences. Unfortunately from what I have seen GPT3.5 is not an option for me since it tokenizes the Greek words into character level so the cost will be very high during training fine-tuning and inference.

From a quick search on HuggingFace I have found the following models
lighteternal/gpt2-finetuned-greek
nikokons/gpt2-greek

Do you think I could use any of those for fine tune it on my dataset?

I am trying to build it from 2019 !!! in order to help my students in school, so any help would be precious to me.

PS (when I use one of the above model to tokenize a Greek sentence
tokenizer = load_tokenizer(model2_path)
tokenizer.tokenize(“Ποιο τρίγωνο λέγεται αμβλυγώνιο?”)

I get

[‘Î’,
‘ł’,
‘Î¿’,
‘Î¹Î¿’,
‘ĠÏĦÏģÎ¯Î³ÏīÎ½Î¿’,
‘ĠÎ»ÎŃÎ³ÎµÏĦÎ±Î¹’,
‘ĠÎ±Î¼Î²’,
‘Î»Ïħ’,
‘Î³ÏİÎ½’,
‘Î¹Î¿’,
‘?’]

does this mean that the vocabulary of those models is not suited for my task?

SamReiswig · August 29, 2023, 5:58pm

Gotcha, I see, output should be Greek language.

The only LLM I know of that explicitly cites Greek is PALM2 and I don’t think we can finetune it.
See Table 21 of https://ai.google/static/documents/palm2techreport.pdf
and the percentage of Greek is < 1%

It’s possible that Llama2 has Greek in it but it would be in small amounts < 0.005% of the training set. See Table 10 of Llama2 Paper

The question is do these LLMs have enough Greek examples that the model learned the structure of the Greek language? If the generated responses in Greek suggest yes then finetuning can be useful.

Otherwise, I suggest collecting a large Dataset in Greek and re-training or training a new model.

PS:

That looks like ASCII trying to output Unicode. The examples on Huggingface suggest that the tokenizer is working correctly.

Hope this helps!

Sam

Topic		Replies	Views
Multilingual LLM finetuning in Greek Finetuning Large Language Models	0	133	August 25, 2023
Llama 3.2 finetuning and evaluations? Introducing Multimodal Llama 3.2	6	103	October 18, 2024
Finetuning - length of data AI Discussions ai-discussions	2	65	October 13, 2023
It will be very helpful if you use open source LLM like falcon-40b-instruct along side OpenAI LangChain for LLM Application Development	0	96	June 7, 2023
Enroll in Finetuning Large Language Models! News and Announcements	2	241	August 25, 2023

Finetuning multilingual LLMs

Related topics