Medical concepts extraction from documents through LLM finetuining

After completing the week 2 course, I would like to finetune a LLM model that can extract clinical concepts (NER) from medical documents which are written in free text. How can I prepare a dataset to finetune any openly available clinical related based model something like clinicalBert for this task. any example over finetuning llm models for NER task would be helpful

Hi @krishnareddy_Nandyal ,

First thought is: most probably it already exists.

Second thought: to do it, you will need to create (or find) a robust dataset (1000s of samples) with at least these two parts each sample: Sentences from medical document written in free text" and “the medical terms in that sentence”.

IF you find or create such dataset, you can fine-tune a model like DistilBERT or others and get good results (85% precision probably). But it will all depend on the quality and size of the dataset.

As for the HOW: You can probably explore this fine tuning using Peft with LoRA as shown in the course.

Hope this helps!

Let me know how it goes or if you have more questions.


Thank you @Juan_Olano for the suggestions, I will look into this

Hi @Juan_Olano ,

Once again thank you for the previous response. Recently I am able to collect dataset for clinical information extraction from medical free text. Following is the one of the sample from my data set.

[Input] Clinical text: “Patient admitted with abdominal pain, nausea. and denied any chills or fever. Patient also has history of appendectomy.”

[Output] Clinical Concepts: I would like to see output in following Json format and i am able to prepare my entire training dataset as follows
{“concept”:“abdominal pain”, “history”:“false”, “sentiment”:“positive”},
{“concept”:“nausea”, “history”:“false”, “sentiment”:“positive”},
{“concept”:“chills”, “history”:“false”, “sentiment”:“negative”},
{“concept”:“fever”, “history”:“false”, “sentiment”:“negative”},
{“concept”:“appendectomy”, “history”:“true”, “sentiment”:“positive”},

Now I would like to prepare this samples for fine tune llama2 , but I am unable to get how i can convert this data set as accepted by huggingface transformers trl library. can you please help me on how we can do this ? I mean i need this samples in the format of llma2 can accept as a training dataset.

As you said there may be already available models for this, but i want to do it this using llama2.

Krishna Reddy.

This is looking interesting!

Check out this dataset:

This dataset is in the format accepted by Llama2 for fine-tuning.

Just take the first record of the dataset and you’ll discover that the syntax goes along these lines:


Try this and let me know how it goes!

Dear @Juan_Olano,

I want to express my gratitude for your valuable suggestions regarding dataset preparation. Following your guidance, I have successfully prepared the dataset. It is clear that for fine-tuning, adherence to a standard format is essential. While the format need not be specific, consistency throughout the training dataset samples is imperative.

I attempted fine-tuning using the llama2 7B model. Unfortunately, the results did not meet our expectations. I recall hearing that fine-tuning may not enable the acquisition of entirely new domain knowledge but rather focuses on altering the response format. Could you kindly confirm if this is accurate? Furthermore, do you recommend trying a larger model, such as the 13B or 70B, for potentially improved results?

Your insights on this matter would be greatly appreciated

Hi @krishnareddy_Nandyal !

I am glad that you were able to move in the right direction! I understand that there’s still a long way to go. This would be my advise:

  1. Fine-tuning. This will improve several aspects of the model that will definitively impact the quality of the output: Format, Style, and to a certain extent domain knowledge. But don’t trust this last one - it is just like the model acquires the ‘flavors’ of the domain, and it will be very dependent on the quality of your training set. Testing with larger models may bring some benefits, but I would stick to the 7B until all options are tested here. What other options? see items 2 and 3 below.

  2. Prompt engineering. Now that you have a fine-tuned model, your prompts will behave better. Make sure that you provide the appropriate context on every call, and try different prompts to guide your model to the expected outputs.

  3. Information retrieval. This one is fundamental. With your data, create a vector database. This will be the perfect context for item 2 above (prompting).

This is the perfect trifecta. If/when you combine these 3 tools, you will get the best possible result from your LLM system.

Please ask any question and I’ll be happy to dig deeper.




can u provide the link of data so that others can use it

Hi @Adeel_Hasan currently this dataset contains patient sensitive information, so not possible for us to open source the dataset at the moment.

Hi @Juan_Olano

I have a question about training and preparing data for a model. Specifically, when we have our training and validation datasets organized as a single text column in a standard format ex: <s>[INST] <INPUT>[/INST] <OUTPUT></s>, how does the model understand and train on this single column as input? It seems like the model needs to know what the input is and what the corresponding output should be in order to compare results. Am I missing something here?

Please check following notebook about training the llama2 7B over some public Q&A dataset. in the script they combining both question and answer in a prompt and using this for training. But in our course we have prepared both input and labels columns to check quality measures using ROUGE score. How do we calculate ROUGE scores for this notebook. my concern even when model getting training how it validate if dataset contains both input and output in same column?

Krishna Reddy

When you train an LLM, the input of the training is text and a label. The text format is usually dependent on the specific model you are training. Llama2, Bert, FlanT5, each have a preferred way to receive their prompts for training. These can be reviewed in their respective documentations in Huggingface. And as with any model that is trained, you have the label. In the case of a transformer being trained for a classification task, the label will be the class of the prompt (the ground truth). In the training loop, you pass the training data (in batches) and the model responds with its inference. This output (the inference) is used to calculate the loss and do the weights adjustments. If you are using a Huggingface library (the transformers library), all this is done for you behind the scenes.

You are in control of this. You load the dataset and you create the data loader. In the data loader you specify how the training data is formatted. Check out this section in the notebook of the lesson. You will see there how the data is prepared for the model.

1 Like

Thank you @Juan_Olano. I got it now. the input will be used for both input and output.
just sake of more clarification for any one please check following link

Hi Krishna, Thanks for sharing the notebook it really helped me. Would you please guide me to being an expert in fine-tuning models with having good knowledge of how things work so I can easily understand problems solve those fine-tuning parameters and prepare a dataset for the specific problem?

Sent Direct Message, Please check hope that will help.