I have a question on the availability of LLMs in a specific language.
For some time, a friend and I have considered creating AI tools for medical doctors and practitioners.
For context, I’m in Argentina (we talk Spanish here).
Recently I stumbled upon AWS HealthScribe
I also saw the announcement on Meditron-70b and github link here.
As you can see, the AWS service and the open-source model will work in English.
After searching, I found that the medical LLM space in Spanish (and possibly other languages) is terribly underserved. Is that so?
Fine-tuning an open-source LLM to produce medical text in Spanish seems daunting. I wouldn’t know where to start.
I am not sure about the dataset availability or where to get the resources to perform the training. This is way beyond my league.
Could you help me validate my assumption here? Is that the case that there aren’t medical LLMs in Spanish?
How would you tackle such a problem?
Thanks in advance!
This concept seems problematic, because there are two stages where inaccuracy can occur.
I believe (not my field though) that most medical journals are in English.
So first you will have to use an English to Spanish translation. This will insert some probability of errors or loss of information.
Then you will need some AI to generate whatever Spanish language reports you need, likely via one of the LLM tools. As we know, chat tools will insert their own very plausible sounding errors, which may be difficult to detect.
@TMosh, thanks for the reply.
I agree; if we translate the original input to English and the model’s output back to Spanish, we will get many things “lost in translation.”
One fact is that most scientific research journals are in English, which makes it logical that there aren’t Medical LLMs in Spanish. This shows that it just propagates a cultural bias outside AI.
So many things to ponder right now.
As Spaniard I can assure that this problem is really common in a lot of areas like Financial or NL2SQL.
There a lot of good LLM’s able to generate good SQL, but you need to answer the model in English.
If the problem is only in the input text, the best solution is translate it, I think that better than try to train a model with only spanish texts, or translate the english texts to spanish before.
Thanks for chiming in.
You’re right on.
The problem with translations in our use case (which, for context, is working with clinical notes taken by doctors) is that doctors use specific terminology and lingo that is not used by the general public. So certain expressions can have a particular meaning that a simple translation engine might not easily translate.
In any case, we are validating that assumption first.
Looks like the main problem is the lack of Spanish medical content. So, it would be best if you started there. How can you acquire a good amount of this kind of content? Medical magazines, journals, etc? Maybe also consider just content in Spanish but not necessarily medical.
Also, maybe you could consider enrolling in the course Finetuning Large Language Models from Deeplearning.ai.
I added some papers that could be of interest:
The chinchilla paper shows how much data is necessary to train an LLM.
[2203.15556] Training Compute-Optimal Large Language Models
This Flan paper also has some interesting orientation about instruction finetuning.
[2301.13688] The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Hey @leonardo.pabon , thanks a lot for your reply.
Those are very good tips. Thank you!
Could I create synthetic data like the EMRBots clinical dataset?
I will check the papers and the course.
To use synthetic data, you would need an LLM that generates good content in Spanish. Maybe some existing model is good enough at that. Otherwise, you would need to finetune a model on that. I read about some people finetuning Llama in Portuguese with supposedly improved results.
Another interesting source is the Alpaca paper. They used GPT to generate synthetic data to create Alpaca.
@leonardo.pabon your insights are super helpful, man.
Thanks a lot for taking the time to share them with me. I appreciate them very much.
It’s good to be helpful. I am Brazilian and spent some time thinking about LLM performance in Portuguese.
I had the chance to share this problem with an Indian data scientist who works in NLP for my company. We are in different areas, but our CEO heard I’m studying AI and allowed me to talk to him when I get questions.
He recommended using machine translation to translate the PubMed dataset to Spanish and then using the translated dataset to fine-tune a model that performs well in basic Spanish so we can get a base model that can produce medical text, and even if the performance isn’t great, it already gives us something to work with.
A team member is concerned that each PubMed sample is usually too specific and narrow, and we are aiming for something more general. I don’t share that concern, but I’d appreciate other points of view from our community.
I’ve found that there are some publications in Spanish (like Revista de medicina e investigación, by Elsevier) that used to publish articles in open access.
A good thing is that they used to publish the abstracts in both English and Spanish, so I guess that may be a nice source. Example: here.
I’d be happy to contribute by trying to make a list of similar “scrappable” sources, if that would be useful.
Here’s another source (this time a dataset) that may be useful: Biomedical Spanish CBOW Word Embeddings in Floret. According to the authors:
The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents.
Hi @carloscapote !
Thanks a lot for sharing those resources.
I’d love to get some help on this. I’m pretty new to AI and Data Science, but I really want to learn, so having someone to talk about these problems will be of great help.
If you don’t mind, I’ll send you a DM so we can talk more about this.
Feel free to DM me but be warned that I’m also getting started into AI. I’ve a couple of years of experience with computers but I took my first AI course only a few months ago and I haven’t worked with it in any real-world application yet.
Anyway the project seems very interesting. I’d be happy to give it a try even if it was only for learning purposes.