Medical LLM in Spanish?

fabricio · December 4, 2023, 6:04pm

Hello, community,

I have a question on the availability of LLMs in a specific language.

For some time, a friend and I have considered creating AI tools for medical doctors and practitioners.

For context, I’m in Argentina (we talk Spanish here).

Recently I stumbled upon AWS HealthScribe
I also saw the announcement on Meditron-70b and github link here.

As you can see, the AWS service and the open-source model will work in English.

After searching, I found that the medical LLM space in Spanish (and possibly other languages) is terribly underserved. Is that so?

Fine-tuning an open-source LLM to produce medical text in Spanish seems daunting. I wouldn’t know where to start.
I am not sure about the dataset availability or where to get the resources to perform the training. This is way beyond my league.

Could you help me validate my assumption here? Is that the case that there aren’t medical LLMs in Spanish?
How would you tackle such a problem?

Thanks in advance!

TMosh · December 4, 2023, 6:08pm

This concept seems problematic, because there are two stages where inaccuracy can occur.

I believe (not my field though) that most medical journals are in English.

So first you will have to use an English to Spanish translation. This will insert some probability of errors or loss of information.

Then you will need some AI to generate whatever Spanish language reports you need, likely via one of the LLM tools. As we know, chat tools will insert their own very plausible sounding errors, which may be difficult to detect.

fabricio · December 4, 2023, 6:27pm

@TMosh, thanks for the reply.

I agree; if we translate the original input to English and the model’s output back to Spanish, we will get many things “lost in translation.”

One fact is that most scientific research journals are in English, which makes it logical that there aren’t Medical LLMs in Spanish. This shows that it just propagates a cultural bias outside AI.

So many things to ponder right now.

Pere_Martra · December 6, 2023, 3:19pm

Hi Fabricio,

As Spaniard I can assure that this problem is really common in a lot of areas like Financial or NL2SQL.

There a lot of good LLM’s able to generate good SQL, but you need to answer the model in English.

If the problem is only in the input text, the best solution is translate it, I think that better than try to train a model with only spanish texts, or translate the english texts to spanish before.

fabricio · December 6, 2023, 9:54pm

Hey Pere,

Thanks for chiming in.
You’re right on.

The problem with translations in our use case (which, for context, is working with clinical notes taken by doctors) is that doctors use specific terminology and lingo that is not used by the general public. So certain expressions can have a particular meaning that a simple translation engine might not easily translate.

In any case, we are validating that assumption first.

leonardo.pabon · December 9, 2023, 6:39pm

Hi Fabricio,

Looks like the main problem is the lack of Spanish medical content. So, it would be best if you started there. How can you acquire a good amount of this kind of content? Medical magazines, journals, etc? Maybe also consider just content in Spanish but not necessarily medical.

Also, maybe you could consider enrolling in the course Finetuning Large Language Models from Deeplearning.ai.

I added some papers that could be of interest:

The chinchilla paper shows how much data is necessary to train an LLM.
[2203.15556] Training Compute-Optimal Large Language Models
This Flan paper also has some interesting orientation about instruction finetuning.
[2301.13688] The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

fabricio · December 9, 2023, 9:06pm

Hey @leonardo.pabon , thanks a lot for your reply.

Those are very good tips. Thank you!

Could I create synthetic data like the EMRBots clinical dataset?
I will check the papers and the course.

leonardo.pabon · December 9, 2023, 10:19pm

To use synthetic data, you would need an LLM that generates good content in Spanish. Maybe some existing model is good enough at that. Otherwise, you would need to finetune a model on that. I read about some people finetuning Llama in Portuguese with supposedly improved results.

leonardo.pabon · December 9, 2023, 10:23pm

Another interesting source is the Alpaca paper. They used GPT to generate synthetic data to create Alpaca.

https://crfm.stanford.edu/2023/03/13/alpaca.html

fabricio · December 11, 2023, 11:22am

@leonardo.pabon your insights are super helpful, man.
Thanks a lot for taking the time to share them with me. I appreciate them very much.

leonardo.pabon · December 11, 2023, 2:19pm

It’s good to be helpful. I am Brazilian and spent some time thinking about LLM performance in Portuguese.

fabricio · December 17, 2023, 9:28am

I had the chance to share this problem with an Indian data scientist who works in NLP for my company. We are in different areas, but our CEO heard I’m studying AI and allowed me to talk to him when I get questions.

He recommended using machine translation to translate the PubMed dataset to Spanish and then using the translated dataset to fine-tune a model that performs well in basic Spanish so we can get a base model that can produce medical text, and even if the performance isn’t great, it already gives us something to work with.

A team member is concerned that each PubMed sample is usually too specific and narrow, and we are aiming for something more general. I don’t share that concern, but I’d appreciate other points of view from our community.

Thank you!

carloscapote · December 19, 2023, 1:03pm

I’ve found that there are some publications in Spanish (like Revista de medicina e investigación, by Elsevier) that used to publish articles in open access.

A good thing is that they used to publish the abstracts in both English and Spanish, so I guess that may be a nice source. Example: here.

I’d be happy to contribute by trying to make a list of similar “scrappable” sources, if that would be useful.

carloscapote · December 19, 2023, 7:14pm

Here’s another source (this time a dataset) that may be useful: Biomedical Spanish CBOW Word Embeddings in Floret. According to the authors:

The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents.

fabricio · December 20, 2023, 11:15am

Hi @carloscapote !

Thanks a lot for sharing those resources.
I’d love to get some help on this. I’m pretty new to AI and Data Science, but I really want to learn, so having someone to talk about these problems will be of great help.

If you don’t mind, I’ll send you a DM so we can talk more about this.

carloscapote · December 20, 2023, 12:24pm

Feel free to DM me but be warned that I’m also getting started into AI. I’ve a couple of years of experience with computers but I took my first AI course only a few months ago and I haven’t worked with it in any real-world application yet.

Anyway the project seems very interesting. I’d be happy to give it a try even if it was only for learning purposes.

jhon.ren2024 · July 18, 2024, 10:08pm

Hello Fabricio.
Just found this email thread and wondering what is the status of this development?
I am also new to the field, but I would like to help if there is chance.
Thank you.
Jhon.

fabricio · July 21, 2024, 7:25pm

Hey John, thanks for chiming in.
To be completely honest, I’ve been focusing on diving deeper into NLP and MLOps to have the foundational knowledge to tackle what I consider will be required to create this model (I recently finished the MLOps specialization, and currently, I am going through the first course of the NLP Specialization).

I think we should start an email group or something like that.

jhon.ren2024 · July 23, 2024, 4:47am

Hello Fabricio.
The email group is a great idea.
I am going to send you a private message with my contact information.

Topic		Replies	Views
Hola soy de México Introductions introductions	2	51	June 10, 2024
Medically trained LLMs AI Discussions ai-discussions	0	312	May 11, 2024
Seeking help in fine tuning of LLM mkdel for medical research purposes AI Discussions careers , project	2	101	October 27, 2024
ChatGPT model tuning AI Discussions	3	88	May 15, 2023
Multilingual LLM finetuning in Greek Finetuning Large Language Models	0	133	August 25, 2023

Medical LLM in Spanish?

Related topics