Keywords: Compensation for data contributors, Data market, RAG market, Data Shapley
Abstract: As large language models increasingly rely on external data sources, compensating data contributors has become a central concern. But how should these payments be devised? We revisit data valuations from a _market-design perspective_ where payments serve to compensate data owners for the _private_ heterogeneous costs they incur for collecting and sharing data.
We show that popular valuation methods—such as Leave-One-Out and Data Shapley—make for poor payments. They fail to ensure truthful reporting of the costs, leading to _inefficient market_ outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We show that Myerson payment is the minimal truthful mechanism, optimal from the buyer’s perspective. Additionally, we identify a condition under which both data buyers and sellers are utility-satisfied, and the market achieves efficiency. Our findings highlight the importance of incorporating incentive compatibility into data valuation design, paving the way for more robust and efficient data markets. Our data market framework is readily applicable to real-world scenarios. We illustrate this with simulations of contributor compensation in an LLM based retrieval-augmented generation (RAG) marketplace tasked with challenging medical question answering.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 20610
Loading