Language Models Defy Logic: Large NLP models struggle with logical reasoning

Who would disagree that, if all people are mortal and Socrates is a person, Socrates must be mortal? GPT-3, for one. Recent work shows that bigger language models are not necessarily better when it comes to logical reasoning.

What’s new: Researchers tested the ability of language models to determine whether a statement follows a set of premises. Simeng Han led the project with collaborators at Yale University, University of Illinois, Iowa City West High School, University of Washington, University of Hong Kong, Penn State University, Meta, and Salesforce.

Key insight: Previous efforts to test logical reasoning in language models were based on datasets that contained limited numbers of words (roughly between 100 and 1,000), premises (up to five per example), and logical structures (less than 50). A more diverse dataset would make a better test.

How it works: The authors assembled FOLIO, a dataset of over 1,400 examples of real-world logical reasoning that uses more than 4,350 words, up to eight premises, and 76 distinct logical structures. They challenged a variety of models to classify whether the relationship between a set of premises and an example conclusion was true, false, or unknown.

  • The authors asked human annotators to generate logical stories of premises and a conclusion. They verified the logic using an automated program.
  • They tested BERT and RoBERTa, two of the most popular language encoders, by appending two fully connected layers and fine-tuning the models on 70 percent of the dataset.
  • They tested Codex, GPT-3, GPT-NeoX-20B, and OPT in 13- and 66-billion parameter variations. They prompted the models with eight labeled examples. Then the model classified an unlabelled example.

Results: A fine-tuned RoBERTa-large (340 million parameters) accurately labeled 62.11 percent of FOLIO’s test examples, while a fine-tuned BERT-large of the same size achieved 59.03 percent accuracy. The probability of predicting the correct answer at random was 33.33 percent. Given eight labeled logic stories as input, Codex (of unknown size) achieved 56.04 percent accuracy, while GPT-3 (175 billion parameters) achieved 43.44 percent.

Why it matters: Language models can solve simple logic puzzles, but their performance is inconsistent and depends a great deal on the prompt they’re given. This work offers a more rigorous benchmark for tracking progress in the field.

We’re thinking: The recently unveiled ChatGPT has wowed many users, but its ability to solve logic problems varies wildly with the prompt. It’s not clear whether some of the outputs shared on social media represented its best — or most embarrassing — results. A systematic study like this would be welcome and important.