Data Distribution Valuation with Incentive Compatibility

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Data distribution valuation, Maximum mean discrepancy, Huber model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a valuation method for data distribution that satisfies incentive compatibility, and design theoretically principled and actionable policies for comparing the values of heterogeneous data distributions.
Abstract: Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a dataset $D$. However, in many use cases, users are interested not only in the value of a dataset, but in the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small sample from each vendor prior to purchasing the data, to decide which vendor's data distribution is most useful to the buyer. The core question of this work is how should we compare the values of data distributions from their samples? Under a Huber model for statistical heterogeneity across vendors, we propose a maximum-mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We show theoretically that our method achieves incentive-compatibility, thus incentivizing the data vendors to report their data truthfully. We demonstrate the efficacy of our proposed valuation method against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5560
Loading