Position: When Incentives Backfire, Data Stops Being Human

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content -- it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations -- rather than relying solely on external incentives -- can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
Lay Summary: Discussions around data quality in machine learning often focus on technical indicators and definitions, overlooking the human sources that generate this data. Much of today’s data comes from user participation on online platforms. This led us to ask: can we learn something about sustaining data quality by examining how humans participate on these platforms? We examine the quantity-quality tradeoff in data generation through the lens of human motivation. Drawing from social science, we show how excessive reliance on external incentives can undermine intrinsic motivation. We propose a shift: design engaging, suitably-incentivized environments (e.g., online games) that encourage meaningful participation while producing high-quality data. Our paper highlights the motivational forces behind online data generation for AI/ML and illustrates cases of past systems that have successfully navigated the quantity-quality tradeoff to generate meaningful human data. We also emphasize key design considerations for building trustworthy data collection environments of the future that will not only generate high quality data, but also respect and support the people contributing it.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: MmRjO
Permissions Form: pdf
Primary Area: Data Set Creation, Curation, and Documentation
Keywords: data collection, human data economics
Submission Number: 207
Loading