Stable Biases: Stable Diffusion may amplify biases in its training data

Stable Diffusion may amplify biases in its training data in ways that promote deeply ingrained social stereotypes.

What’s new: The popular text-to-image generator from tends to underrepresent women in images of prestigious occupations and overrepresent darker-skinned people in images of low-wage workers and criminals, Bloomberg reported.

How it works: Stable Diffusion was pretrained on five billion text-image pairs scraped from the web. The reporters prompted the model to generate 300 face images each of workers in 14 professions, seven of them stereotypically “high-paying” (such as lawyer, doctor, and engineer) and seven considered “low-paying” (such as janitor, fast-food worker, and teacher). They also generated images for three negative keywords: “inmate,” “drug dealer,” and “terrorist.” They analyzed the skin color and gender of the resulting images.

  • The reporters averaged the color of pixels that represent skin in each image. They grouped the average color in six categories according to a scale used by dermatologists. Three categories represented lighter-skinned people, while the other three represented darker-skinned people.
  • To analyze gender, they manually classified the perceived gender of each image’s subject as “man,” “woman,” or “ambiguous.”
  • They compared the results to United States Bureau of Labor Statistics data that details each profession’s racial composition and gender balance.

Results: Stable Diffusion’s output aligned with social stereotypes but not with real-world data.

  • The model generated a higher proportion of women than the U.S. national percentage in four occupations, all of them “low-paying” (cashier, dishwasher, housekeeper, and social worker).
  • For instance, Stable Diffusion portrayed women as “doctors” in 7 percent of images and as “judges” in 3 percent. In fact, women represent 39 percent of U.S. doctors and 34 percent of U.S. judges. Only one generated image of an “engineer” depicted a woman, while women represent 14 percent of U.S. engineers. (Of course, the U.S. percentages likely don’t match those in other countries or the world as a whole.)
  • More than 80 percent of Stable Diffusion’s images of inmates and more than half of its images of drug dealers matched the three darkest skin tone categories. Images of “terrorists” frequently showed stereotypically Muslim features including beards and head coverings.
  • The authors point out that skin color does not equate to race or ethnicity, so comparisons between color and real-world demographic data are not valid.

Behind the news: Image generators have been found to reproduce and often amplify biases in their training data.

  • In March 2023, researchers at Leipzig University and HuggingFace found that both DALL•E 2 and Stable Diffusion tended to overrepresent men relative to the U.S. workforce. (The previous July, OpenAI had reported that it was addressing issues of this sort.)
  • Pulse, a model designed to sharpen blurry images, caused controversy in 2020 when it transformed a pixelated headshot of former U.S. president Barack Obama, who is black, into a face of a white man. More recently, users of the Lensa photo editor app, which is powered by Stable Diffusion, reported that it sexualized images of women.
  • In 2020, after studies showed that ImageNet contained many images with sexist, racist, or hateful labels, the team that manages the dataset updated it to eliminate hateful tags and include more diverse images. Later that year, the team behind the dataset TinyImages withdrew it amid reports that it was rife with similar issues.

Why it matters: Not long ago, the fact that image generators reflect and possibly amplify biases in their training data was mostly academic. Now, because a variety of software products integrate them, such biases can leach into products as diverse as video games, marketing copy, and law-enforcement profiles.

We’re thinking: While it’s important to minimize bias in our datasets and trained models, it’s equally important to use our models in ways that support fairness and justice. For instance, a judge who weighs individual factors in decisions about how to punish a wrongdoer may be better qualified to decide than a model that simply reflects demographic trends in criminal justice.


Interesting post. Thank you.

Looks like the Fitzpatrick skin tone scale was used for this. For those who are interested in the topic, the Wikipedia link describes this scale below. The scale itself has been criticized for more representation in the lower end than in the upper end. So, one wonders if the correction should start with this scale. If so, what could take its place? I read that Fenty Beauty by Rihanna recently got acclaim for its launch of 40 shades of foundation. Perhaps using that would be more nuanced.

Fitzpatrick scale - Wikipedia

Katherine Moss