Position: Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: everyday creativity tasks done by AI will impact society, we need to be better measure them
Abstract: Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify *creative composition tasks* as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.
Lay Summary: While AI systems are becoming widespread, we lack proper ways to evaluate their potential societal harms, particularly the risk of flooding our information ecosystem with generic or misleading content. We analyzed 2 million real user requests to understand actual usage patterns and discovered that open-ended tasks requiring everyday creativity, like writing emails or social media posts, are among the most common uses. We introduce the term: "creative composition tasks" to describe this broad category of tasks. We then found significant mismatches between these real-world applications and existing AI benchmarks. We identified a critical blind spot: the creative tasks people use AI for most are exactly those that could cause significant societal harm through homogeneous content, bad advice, and unintended communication between people, yet they're not well-evaluated. Our research demonstrates the urgent need for new benchmarks focused on creative composition to properly measure both AI progress and its potential negative impacts. Without addressing this gap, we're unable to understand or mitigate the risks as these powerful systems become more prevalent in society.
Primary Area: Social, Ethical, and Environmental Impacts
Keywords: Societal Impacts, Evaluation, Creativity
Submission Number: 125
Loading