How to fine-tuning from a stack of PDFs which are not in Q&A format?


My goal is to fine-tune LLMs on a stack of PDFs to learn domain specific knowledge (for example, private medical knowledge database that describe additional information for each diseases listed). These stack of PDFs are not formatted as Q&A pairs, and they contain a mixture of different types of data, for example textual information, tables, even references.

So how to fine-tune the LLMs given such data? I don’t think it’s easy to create Q&A pairs out of these PDFs. My goal is to have LLMs learn these domain specific information and then answer medical related questions.

Also can you please explain the differences in fine-tuning GPT-3.5 / GPT-4 vs other open source LLMs that can be localized from HuggingFace directly? It seems like the course is only teaching fine-tuning models from HuggingFace but not GPT3.5/4 models. I want to also learn how to fine-tune GPT-3.5/4 and at the same time not have my data/fine-tuned model available to public or even OpenAI. Is there a way?

Thanks for any feedback here!!!


@MrHuanwang I understand that using LLM APIs means exposing your user-sensitive data to external parties like OpenAI. Unless you are using an enterprise version of these LLM APIs, it is difficult to protect your data from being misused as you might not be covered with appropriate data protection contracts. (Unless you want to host the LLM on your local and fine-tune it, which IMO is not an easy task.).

Here are the few options that I can think of for you:

  1. Use Vector Databases and Embeddings API (HF or OpenAI). Convert all your pdfs to vector database and use Retrieval techniques (topK, MMR etc.) to fetch answers to queries.
  2. If compute resource is not an issue, try smaller LLMs like Mistral 7B or 13B (from HF) rather than using LLM APIs.
  3. Buy an enterprise version of LLM. It usually comes with QA as a task capability where you can upload PDFs and customize how LLMs splits and reads these PDFs.

However, for a better performance I would suggest converting the PDFs into a simpler text format (.txt , .csv) and removing unwanted objects like images or links. Also look into how to split documents (PDFs) in an efficient way.

Hi, @jyadav202

Thanks for the quick response. All your suggestions make sense and I have ideas of how to go about each. But it seems none of the suggestions are suitable for approaches learned from this course, which is really fine-tuning open source models (e.g. LLMs from HF) using Q&A datasets. Am I right to assume this?

Another question, is it even possible to fine-tune chat LLMs (either from HF or OpenAI) using description texts (not formatted as Q&A pairs) to get fine-tuned LLM model for chat generation task? Basically have LLM to learn specific domain knowledge (non Q&A knowledges) and perform chat task? My guess is probably not, because chat models require any further fine-tuning dataset to be in Q&A formats as it has to align with the model’s intended task?



The suggestions I presented were keeping in mind this course as well. Your use-case is more about chatting with your data and for that you can look into the other short-course “LangChain: Chat with Your Data”.
It is very much possible to fine-tune LLMs with your database, but it has a slight downside in your case. That is you like to query your private medical knowledge base (KB) and most likely you would want the model to return accurate answers (actually high Recall) from this KB. Fine-tuning a model would result in its weight updates, so the answers it will generate will be from its current learnings and old knowledge. To avoid this mixture of knowledge, you can instead use Vector DB to query from your KB and then use system prompts to generate answers by your LLMs.


I actually tried similar approaches to Vector DB by following the examples from course " Building and Evaluating Advanced RAG Applications", which I believe uses even more advanced RAG techniques than the basic RAG technique used in Vector DB. However, the result isn’t great. Thus I want to try fine-tune instead.

I understand that fine-tuned model from private medical KB would cause updated weights containing information from both current learnings and old knowledge. If I’m ok with answers generated via both new and old knowledge, how exactly do I fine-tune a chat LLM with private medical KB (again not formatted as Q&A)? Based on what’s shown in this course, it seems like it requires Q&A formatted dataset for fine-tuning chat LLMs. Again is it possible to do this for GPT-3.5/4 if I do have the enterprise version? I was reading OpenAI document, and it seems like it also requires Q&A datasets for fine-tuning.

In my knowledge, the LLM woud need to update weights as per the objective function and calculate loss. If we are performing instruction fine-tuning, as in the course, it would need a QA pair to which it would first generate answers and then calculate loss with actual answer.
To get this QA pair you would need LLM to do it for you. For that you anyway need to perform RAG, so that LLM can then generate these pairs. The quality of these pairs would determine how well your LLM will fine-tune.

I recommend you ask your question on vendor-specific forums if you are using their APIs on how to perform fine-tuning in supervised and unsupervised fashion.

Hello Huan, I’ve read your comment about fine-tuning LLMs on PDFs to learn domain-specific knowledge. Have you successfully completed these tasks you described, and if so, could you share some details about your methodology and code? I would like to do the same in a different domain, please.