LLMs & creativity: measuring divergent association

When looking at creative tasks, how do large language models compare to human performance?

There has been some recent machine learning research on the subject of creativity. I followed some of these leads into a fascinating area of psychology, on how “divergent” thinking can be used to look at the creative process. Is there some way of quantifying creativity? What are the methods used to do this? Can large language models (LLMs) be considered as creative?

I decided to have a look at one such quantitative task, as outlined in the paper “Naming unrelated words predicts creativity” by Olson and Nahas et al. The basis concept behind the study is the idea that creative people are able to generate more divergent ideas, and suggests that naming unrelated words and then measuring the semantic distance between them could serve as an objective measure of divergent thinking.

The study introduced a new measure of divergent thinking called the Divergent Association Task (DAT). This task asks participants to generate 10 nouns that are as different from each other as possible in all meanings and uses of the words. They then evaluated the results of this task by computing the semantic distance.

This led to the fairly intuitive idea of using LLMs to attempt the same task; the results are given in the PDF, in the repo (code etc. is available). This suggests further, related, questions: can we apply LLMs to other creative tasks, like, for example, the Alternate Uses Test? How do LLMs compare to human performance in these & other, related tasks?

Here’s a little more research on creativity and large language models, related to the work above.

I’ve been working on a number of research projects recently. One area I focused on was the idea of large language models (LLMs) and creativity.

The Alternate Uses Test (AUT) is a test of divergent creativity. A participant is given a number of everyday objects, such as a bowl or a pencil, and are asked to generate as many different uses for it. They are marked on how creative, original, and fluent their answers are.

An interesting question could be: can LLMs score well on the AUT? How well? Could they be even better than human performance, even if they’re running on a consumer laptop GPU?

Here’s the paper on what I found out.

The_Alternate_Uses_Test__human_and_language_model_divergent_creativity.PDF (206.5 KB)