Position: When Incentives Backfire, Data Stops Being Human

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content -- it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations -- rather than relying solely on external incentives -- can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
Lay Summary: Discussions around data quality in machine learning often focus on technical indicators and definitions, overlooking the human sources that generate this data. Much of today’s data comes from user participation on online platforms. This led us to ask: can we learn something about sustaining data quality by examining how humans participate on these platforms? We examine the quantity-quality tradeoff in data generation through the lens of human motivation. Drawing from social science, we show how excessive reliance on external incentives can undermine intrinsic motivation. We propose a shift: design engaging, suitably-incentivized environments (e.g., online games) that encourage meaningful participation while producing high-quality data. Our paper highlights the motivational forces behind online data generation for AI/ML and illustrates cases of past systems that have successfully navigated the quantity-quality tradeoff to generate meaningful human data. We also emphasize key design considerations for building trustworthy data collection environments of the future that will not only generate high quality data, but also respect and support the people contributing it.
Primary Area: Data Set Creation, Curation, and Documentation
Keywords: data collection, human data economics
Submission Number: 207
Loading