Hello,
First off all I would like to congratulate you for the courses, they are all excellent and very helpful for someone who starts now with LLMs. My question is the following:
Is there any LLM base or finetuned to follow instructions (chat) which supports the Greek language and I can finetune it further in order to answer questions about mathematical definitions and methodologies? Do you have any suggestions?
Hello, thank you for your message!
May I add your message to the finetning category? I am sure our amazing mentors and fellow learners will gladly help.
Yes of course, feel free to do so.
Hi @redrocker08 !
Do you want the model to speak Greek?
If so, I wasn’t able to find a model that was trained specifically on Greek. Not even BLOOM had Greek in it’s training set.
However, if I’m reading your question correctly, it sounds like just having the Greek language symbols is good enough for Math definitions.
OpenAI just announced fine-tuning for gpt3.5 turbo OpenAI Platform
PALM2 has support for Greek Symbols.
LLaMA2 also has support for Greek Symbols and can be found on huggingface in various sizes.
You can also instruct these models in the prompt to output results in an intermediate format like markdown or latex which can render Greek Symbols.
Hope this helps!
Sam
Dear Sam, first of all thank you for your reply to my question.
Let me be more precise. My dataset consists of questions and answers (currently I have 1000 samples) written in the Greek language related to mathematical definitions and methodologies. For example let me demonstrate some translated examples. Let me note here that I can not use google translate from Greek to English since I have specific terminology in Greek (for example παραπληρωματικές supplementary συμπληρωματικές complementary) google translate confuses them.
[Q] when two angles are called supplementary?
[A] two angles are called supplementary when they sum to 180 degrees
or
[Q] How do we call two angles which sum up to 180 degrees?
[A] two angles which sum up to 180 degrees are called supplementary.
[Q] when does a triangle called isosceles?
[A] I triangle is called isosceles when it has two equal in length sides or to equal angles
[Q] how do I call a triangle which has two equally in length sides?
[A] isosceles
So I am searching for an LLM ( a multilingual perhaps?) which can speak Greek or have some knowledge of the Greek language and generate Greek sentences. Unfortunately from what I have seen GPT3.5 is not an option for me since it tokenizes the Greek words into character level so the cost will be very high during training fine-tuning and inference.
From a quick search on HuggingFace I have found the following models
lighteternal/gpt2-finetuned-greek
nikokons/gpt2-greek
Do you think I could use any of those for fine tune it on my dataset?
I am trying to build it from 2019 !!! in order to help my students in school, so any help would be precious to me.
PS (when I use one of the above model to tokenize a Greek sentence
tokenizer = load_tokenizer(model2_path)
tokenizer.tokenize(“Ποιο τρίγωνο λέγεται αμβλυγώνιο?”)
I get
[‘Î’,
‘ł’,
‘ο’,
‘ιο’,
‘ĠÏĦÏģίγÏīνο’,
‘ĠλÎŃγεÏĦαι’,
‘Ġαμβ’,
‘λÏħ’,
‘γÏİν’,
‘ιο’,
‘?’]
does this mean that the vocabulary of those models is not suited for my task?
Gotcha, I see, output should be Greek language.
The only LLM I know of that explicitly cites Greek is PALM2 and I don’t think we can finetune it.
See Table 21 of https://ai.google/static/documents/palm2techreport.pdf
and the percentage of Greek is < 1%
It’s possible that Llama2 has Greek in it but it would be in small amounts < 0.005% of the training set. See Table 10 of Llama2 Paper
The question is do these LLMs have enough Greek examples that the model learned the structure of the Greek language? If the generated responses in Greek suggest yes then finetuning can be useful.
Otherwise, I suggest collecting a large Dataset in Greek and re-training or training a new model.
PS:
That looks like ASCII trying to output Unicode. The examples on Huggingface suggest that the tokenizer is working correctly.
Hope this helps!
Sam