As you can see, the AWS service and the open-source model will work in English.
After searching, I found that the medical LLM space in Spanish (and possibly other languages) is terribly underserved. Is that so?
Fine-tuning an open-source LLM to produce medical text in Spanish seems daunting. I wouldn’t know where to start.
I am not sure about the dataset availability or where to get the resources to perform the training. This is way beyond my league.
Could you help me validate my assumption here? Is that the case that there aren’t medical LLMs in Spanish?
How would you tackle such a problem?
This concept seems problematic, because there are two stages where inaccuracy can occur.
I believe (not my field though) that most medical journals are in English.
So first you will have to use an English to Spanish translation. This will insert some probability of errors or loss of information.
Then you will need some AI to generate whatever Spanish language reports you need, likely via one of the LLM tools. As we know, chat tools will insert their own very plausible sounding errors, which may be difficult to detect.
I agree; if we translate the original input to English and the model’s output back to Spanish, we will get many things “lost in translation.”
One fact is that most scientific research journals are in English, which makes it logical that there aren’t Medical LLMs in Spanish. This shows that it just propagates a cultural bias outside AI.
As Spaniard I can assure that this problem is really common in a lot of areas like Financial or NL2SQL.
There a lot of good LLM’s able to generate good SQL, but you need to answer the model in English.
If the problem is only in the input text, the best solution is translate it, I think that better than try to train a model with only spanish texts, or translate the english texts to spanish before.
The problem with translations in our use case (which, for context, is working with clinical notes taken by doctors) is that doctors use specific terminology and lingo that is not used by the general public. So certain expressions can have a particular meaning that a simple translation engine might not easily translate.
In any case, we are validating that assumption first.
Looks like the main problem is the lack of Spanish medical content. So, it would be best if you started there. How can you acquire a good amount of this kind of content? Medical magazines, journals, etc? Maybe also consider just content in Spanish but not necessarily medical.
Also, maybe you could consider enrolling in the course Finetuning Large Language Models from Deeplearning.ai.
To use synthetic data, you would need an LLM that generates good content in Spanish. Maybe some existing model is good enough at that. Otherwise, you would need to finetune a model on that. I read about some people finetuning Llama in Portuguese with supposedly improved results.
I had the chance to share this problem with an Indian data scientist who works in NLP for my company. We are in different areas, but our CEO heard I’m studying AI and allowed me to talk to him when I get questions.
He recommended using machine translation to translate the PubMed dataset to Spanish and then using the translated dataset to fine-tune a model that performs well in basic Spanish so we can get a base model that can produce medical text, and even if the performance isn’t great, it already gives us something to work with.
A team member is concerned that each PubMed sample is usually too specific and narrow, and we are aiming for something more general. I don’t share that concern, but I’d appreciate other points of view from our community.
I’ve found that there are some publications in Spanish (like Revista de medicina e investigación, by Elsevier) that used to publish articles in open access.
A good thing is that they used to publish the abstracts in both English and Spanish, so I guess that may be a nice source. Example: here.
I’d be happy to contribute by trying to make a list of similar “scrappable” sources, if that would be useful.
The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents.
Thanks a lot for sharing those resources.
I’d love to get some help on this. I’m pretty new to AI and Data Science, but I really want to learn, so having someone to talk about these problems will be of great help.
If you don’t mind, I’ll send you a DM so we can talk more about this.
Feel free to DM me but be warned that I’m also getting started into AI. I’ve a couple of years of experience with computers but I took my first AI course only a few months ago and I haven’t worked with it in any real-world application yet.
Anyway the project seems very interesting. I’d be happy to give it a try even if it was only for learning purposes.
Hello Fabricio.
Just found this email thread and wondering what is the status of this development?
I am also new to the field, but I would like to help if there is chance.
Thank you.
Jhon.
Hey John, thanks for chiming in.
To be completely honest, I’ve been focusing on diving deeper into NLP and MLOps to have the foundational knowledge to tackle what I consider will be required to create this model (I recently finished the MLOps specialization, and currently, I am going through the first course of the NLP Specialization).
I think we should start an email group or something like that.