What if there's no data in question-answer format?

archon · October 6, 2023, 12:51am

Can I use lamini to finetune a model on my own documentation? But there’s no data in question-answer format, like it’s always shown throughout the course, so I can’t use the same technique, instruction finetuning. I just have a 500 lines txt file with the docs (actually it’s not even documentation, it’s a rulebook of a board game)

balaji.ambresh · October 6, 2023, 4:49am

Why can’t you come up with question / answer pairs?

Here’s an example for monopoly:
Question: What should I do when I land on a chance square?
Answer: Pick a card from the chance pile of cards and follow the instructions.

archon · October 7, 2023, 12:16am

I don’t see how it could, at all, be done to completely cover the entire rulebook.

I tried to use it in a prompt in openai gpt api, but it’s too big for their gpt-3.5-turbo-16k model (my prompt is around 29500 tokens). gpt-4-32k could handle that prompt, but it’s not publically available at the moment.

balaji.ambresh · October 7, 2023, 4:54am

If the rules are in sections that are unrelated to each other, it’s possible to get good results by doing the following:

Generate text embeddings for each section.
When a query is received, find the rule section that’s closest in terms of embeddings.
Submit another query to openai with just the section of the candidate section with your query again.

Another way:

Summarize the rule book and create a smaller document. Most of the rules should be in sections. So, summarizing several sections of the rule book in 1 shot is a good place to start.
Query this smaller document.

If none of these are helpful, please see this post on openai forum and consider posting your question there.

devang_pagare · October 19, 2023, 2:59am

For your first approach, is there any techniques other than using langchain to pass query output of one model to other model?

Also, how would you identify the section to which the query belongs?

Topic		Replies	Views
How to fine-tuning from a stack of PDFs which are not in Q&A format? Finetuning Large Language Models	6	3673	May 10, 2024
MEMORY FINETUNNING: Data preparation for Chat. I only have long chunks of proprietary text data Improving Accuracy of LLM Applications	0	50	August 16, 2024
Theory into practice: Generative AI lifecycle GenAI with LLMs Resources	8	705	July 21, 2023
Embedding of "question" asked? LangChain: Chat with Your Data	0	114	December 29, 2023
Fine-tuning an LLM on non-Q&A and unlabeled dataset Finetuning Large Language Models	0	390	September 30, 2023

What if there's no data in question-answer format?

Related topics