Keywords: Data market; data valuation; data selection
TL;DR: The AI data value chain strips economic value from its generators, benefiting only aggregators. We identify missing provenance, power imbalances, and non-dynamic pricing as causes and envision a framework for a fairer data exchange that benefits all.
Abstract: We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.
Lay Summary: AI systems learn from vast amounts of data, but today much of the money and control flows to large platforms that collect and resell that data, not to the people and organizations who generate it. We examined 73 publicly reported data deals and found that creators typically receive little or no ongoing compensation, and the details of most deals are opaque. We argue this situation is unstable for the broader AI ecosystem: if contributors are excluded, the supply of diverse, high‑quality data will shrink, and innovation will suffer.
We identify three root problems. First, missing provenance (traceability): once data is pooled, it’s hard to see where it came from or route rewards. Second, unequal bargaining power: fragmented contributors face take‑it‑or‑leave‑it terms. Third, one‑time, flat pricing that ignores how useful a specific dataset is for a given task.
We outline EDVEX, a framework to make data exchanges fairer and more efficient. It matches tasks to the right data, tracks how data is used, and pays contributors in proportion to the measurable benefit their data provides. This can lower legal risk, improve data discovery, and give all players a way to participate. Our paper outlines the core ideas and research directions needed to make such a system practical.
Submission Number: 595
Loading