Language Models, Extended: Language models grew more reliable and less biased in 2022

Researchers pushed the boundaries of language models to address persistent problems of trustworthiness, bias, and updatability.

What happened: While many AI labs aimed to make large language models more sophisticated by refining datasets and training methods — including methods that trained a transformer to translate 1,000 languages — others extended model architectures to search the web, consult external documents, and adjust to new information.

Driving the story: The capacity of language models to generate plausible text outstrips their ability to discern facts and resist spinning fantasies and expressing social biases. Researchers worked to make their output more trustworthy and less inflammatory.

  • In late 2021, DeepMind proposed RETRO, a model that retrieves passages from the MassiveText dataset and integrates them into its output.
  • AI21 Labs’ spring launch of Jurassic-X introduced a suite of modules — including a calculator and a system that queries Wikipedia — to fact-check a language model’s answers to math problems, historical facts, and the like.
  • Researchers at Stanford and École Polytechnique Fédérale de Lausanne created SERAC, a system that updates language models with new information without retraining them. A separate system stores new data and learns to provide output to queries that are relevant to that data.
  • Meta built Atlas, a language model that answer questions by retrieving information from a database of documents. Published in August, this approach enabled an 11 billion-parameter Atlas to outperform a 540 billion-parameter PaLM at answering questions.
  • Late in the year, OpenAI fine-tuned ChatGPT to minimize untruthful, biased, or harmful output. Humans ranked the quality of the model’s training data, then a reinforcement learning algorithm rewarded the model for generating outputs similar to those ranked highly.
  • Such developments intensified the need for language benchmarks that evaluate more varied and subtle capabilities. Answering the call, more than 130 institutions collaborated on BIG-bench, which includes tasks like deducing a movie title from emojis, participating in mock trials, and detecting logical fallacies.

Behind the news: Amid the progress came a few notable stumbles. The public demo Meta’s Galactica, a language model trained to generate text on scientific and technical subjects, lasted three days in November before its developers pulled the plug due to its propensity to generate falsehoods and cite nonexistent sources. In August, the chatbot BlenderBot 3, also from Meta, quickly gained a reputation for spouting racist stereotypes and conspiracy theories.

Where things stand: The toolbox of truth and decency in text generation grew substantially in the past year. Successful techniques will find their way into future waves of blockbuster models.