Memorize Less; Retrieve More: How small language models can perform specialized tasks


Large language models are trained only to predict the next word based on previous ones. Yet, given a modest fine-tuning set, they acquire enough information to learn how to perform tasks such as answering questions. New research shows how smaller models, too, can perform specialized tasks relatively well after fine-tuning on only a handful of examples.

What’s new: Atlas is a language model of modest size that fulfills prompts by referring to external documents. Gautier Izacard and Patrick Lewis led the project with colleagues at Meta, École Normale Supérieure, Paris Sciences et Lettres, Inria, and University College London.

Key insight: A large language model uses its huge complement of parameters to memorize information contained in its pretraining and fine-tuning datasets. It wouldn’t need to memorize so much — and thus wouldn’t need so many parameters — if it had access to documents on demand.

How it works: Atlas comprises a retriever that’s pretrained to fetch relevant documents from Wikipedia and Common Crawl, and a language model that uses the documents in those datasets to respond to prompts. The authors fine-tuned the system to complete tasks including answering open-ended questions in KILT and multiple choice questions in MMLU.

  • The retriever includes two transformers. One learned to produce an embedding of a prompt (when fine-tuning for, say, answering questions, it learned to produce an embedding of a question). The other learned to produce an embedding of a document, which was stored.
  • The language model, an encoder-decoder that produces its own embedding of the document, was trained by having it fill in missing words in Wikipedia and Common Crawl.
  • The authors further trained the retriever and language model on a similar task (but different loss functions). The language model, given new text with missing words and its own document embeddings, learned to fill in the missing words. The retriever, given the text with missing words, learned to identify documents that contain similar text. The retriever’s loss function encouraged it to rate documents as more similar to the prompt if the language model was more confident in the text it generated using those documents.
  • Given a prompt, the retriever compared it to its stored document embeddings and selected the 20 most relevant documents. Then, given the prompt and embeddings, the language model generated the output.

Results: MMLU offers four possible answers to each question, so random chance is 25 percent. Fine-tuned on five examples in MMLU, Atlas (11 billion parameters) achieved 47.9 percent average accuracy, while GPT-3 (175 billion parameters) achieved 43.9 percent average accuracy. (Atlas didn’t beat the 70-billion parameter Chinchilla, which achieved 67.5 average accuracy.) Fine-tuned on all MMLU training examples, Atlas achieved 66 percent average accuracy, while GPT-3 achieved 53.9 percent average accuracy. The questions in KILT’s Natural Questions subset are open-ended, so accuracy measures the percentage of outputs that exactly matched ground truth. Fine-tuned on 64 Natural Questions examples, Atlas achieved 42.4 percent accuracy, while next-best PaLM (540 billion parameters) achieved 39.6 percent accuracy. Fine-tuned on all Natural Questions training examples, Atlas achieved 60.4 percent accuracy, while the previous state of the art R2-D2 (1.3 billion parameters) achieved 55.9 percent accuracy.

Why it matters: Training smaller models consumes less energy and costs less. Shifting the knowledge memorized by the model from the parameters into an external database not only reduces the number of necessary parameters, but also makes the model’s knowledge easier to update. Instead of retraining the model, you can simply extend the document database by feeding new data to the models and storing the resulting document embeddings.

We’re thinking: Augmenting a language model’s training with retrieved documents is a promising avenue of research. RETRO did something similar, but it wasn’t fine-tuned on particular tasks, much less on a handful of examples. Similarly, researchers at Meta built a chatbot that used documents found on the web to generate more realistic conversations.