Keywords: Data distribution valuation, Huber model distribution, Maximum mean discrepancy
TL;DR: We propose a maximum mean discrepancy-based data distribution valuation metric for data distributions that follow the Huber model.
Abstract: Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a dataset D. However, in many use cases, users are interested not only in the value of a dataset, but in the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small sample from each vendor prior to purchasing the data, to decide which vendor’s data distribution is most useful to the buyer. The core question is how should we compare the values of data distributions from their samples? Under a Huber model for statistical heterogeneity across vendors, we propose a maximum-mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Primary Subject Area: Data collection and benchmarking techniques
Paper Type: Research paper: up to 8 pages
Participation Mode: Virtual
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 21
Loading