What if we train model on plain text , with no instructions

I want to train a model on medical vocabulary; the end result is a task based model only, but should I first include a lot of articles and book texts that have medical terminology and then include instruction-based data?

Hi @anand_trivedi

Welcome to the community.

Work with medical domain field requires some ethical efforts and compliance commitment.

Keep in mind the ethical considerations and data privacy when working with medical data. Ensure compliance with relevant regulations and guidelines to protect sensitive patient information.

With that in mind, a find a very good article that would be a good starting point to answer your question.

Also, we got a very good course related to this domain available on Coursera

I hope this help you in your journey.

Best regards
elirod

Your training should be divided into 2 phases, Pre training and Fine Tuning.

Phase 1: Pre training
The base LLM is trained using large corpus of medical data you have collected(articles, books etc). For this phase, the type of Model you choose for training depends on the final use case. You say it’s a “task based” model but what type of tasks you want to use this model for? Here’s the available options for model types and the tasks they are good at

Autoencoding models(Encoding only)

  • Sentiment analysis
  • Named entity recognition
  • Word classification

Autoregressive models(Decoding only)

  • Text generation
  • Code generation

Sequence to Sequence Models(Encoding and Decoding)

  • Language Translation
  • Text summarization
  • Question answering

Let’s say you want to develop Question answering application that generates answers for medical related questions. Then, you should choose Sequence to Sequence models which contain both Encoder and Decoder.

Phase 2: Fine tuning

At this point, the pre trained model has a good understanding of your medical vocabulary and has native abilities for sequence to sequence generation. The analogy is similar to a student who has fundamental training in physics. Fine tuning is similar to training this student to perform research in physics. Here, he uses his base training on physics to perform fundamental research and applies techniques, methodologies in the context of research.

In Fine tuning, you create prompt templates for a particular task(say text summarization). For example,

Training Input:
Summarize the following medical conversation
{Input prompt}
Summary:

Training label:
Summary:
{Human baseline summary}