When you’re looking for answers from a large language model, some prompts are better than others. So how can you come up with the best one? A new model automates the process.
What’s new: Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, and colleagues at University of Toronto, Vector Institute, and University of Waterloo developed a procedure for generating effective text to prompt large language models: Automatic Prompt Engineer (APE).
Key insight: Given a handful of input-output pairs, a large language model can generate a prompt that, along with the same inputs, would result in the similar outputs. Moreover, having produced a prompt, it can generate variations that may result in even more similar outputs.
How it works: APE requires two large language models: a prompt generator (which produces prompts) and a content generator (which, given a prompt, produces output). For the prompt generator, they tried both language models that complete inputs (such as GPT-3 and InstructGPT) and those that fill in blanks in inputs (such as T5, GLM, and InsertGPT). For the content generator, they used InstructGPT.
- The authors fed the prompt generator a prompt such as, “I gave a friend an instruction and five inputs. The friend read the instruction and wrote an output for every one of the inputs. Here are the input-output pairs:” followed by a small set of example inputs and outputs, such as the names of two animals and which one is larger, from Instruction Induction. After the example inputs and outputs, the prompt concluded, “The instruction was ”. The prompt generator responded with a prompt such as “Choose the animal that is bigger.”
- They fed the generated prompt plus 50 example inputs from the dataset to the content generator, which generated outputs.
- They scored the prompt’s quality based on how often the content generator produced outputs that exactly matched the expected outputs.
- They sharpened the prompt by asking the prompt generator to produce a prompt similar to the highest-scoring one (“Generate a variation of the following instruction . . . ”) and repeated the process. They performed this step three times. For example, a higher-scoring variation of the earlier prompt example is “Identify which animal is larger”.
Results: Earlier work on automated prompt engineering used large language models to generate prompts but didn’t iteratively refine them. In 19 out of the 24 tasks in Instruction Induction, prompts generated by InstructGPT using APE outperformed the earlier work as well as human-engineered prompts according to Interquartile Mean (IQM), the mean exact-match accuracy after discarding the lowest and the highest 25 percent. On all 24 tasks, prompts produced by InstructGPT using APE achieved 0.765 IQM, while human prompts achieved 0.749 IQM. By optimizing measures of truthfulness and informativeness, the method produced prompts that steered the content generator to produce output with those qualities. For instance, on TruthfulQA, a question-answering dataset that tests for truthful and informative answers, answers produced by InstructGPT using APE were rated true and informative 40 percent of the time, while answers produced using prompts composed by humans achieved 30 percent (although the generated answers produced by InstructGPT using APE often take shortcuts such as “no comment,” which has high truthfulness but little information).
Why it matters: As researchers develop new large language models, APE provides a systematic way to get the most out of them.
We’re thinking: Prompt engineers have only existed for a few years, and already robots are coming for their jobs!