Crawl the Web, Absorb the Bias

Hey Everyone!

Check out the emerging generation of trillion-parameter models which need datasets of billions of examples, but the most readily available source of examples on that scale — the web — is polluted with bias and antisocial expressions. A new study examines the issue.

What’s new: Abeba Birhane and colleagues at University College Dublin and University of Edinburgh audited the LAION-400M dataset, which was released in September. It comprises data scraped from the open web, from which inaccurate entries were removed by a state-of-the-art model for matching images to text. The automated curation left plenty of worrisome examples among the remaining 400 million examples — including stereotypes, racial slurs, and sexual violence — raising concerns that models trained on LAION-400M would inherit its shortcomings.

Why it matters: It’s enormously expensive to manually clean a dataset that spans hundreds of millions of examples. Automated curation has been viewed as a way to ensure that immense datasets contain high-quality data. This study reveals serious flaws in that approach.

We’re thinking: Researchers have retracted or amended several widely used datasets to address issues of biased and harmful data. Yet, as the demand for data rises, there’s no ready solution to this problem. Audits like this make an important contribution, and the community — including large corporations that produce proprietary systems — would do well to take them seriously.

To read full story, click here

1 Like