When do Score-Based Data Valuation Methods Work, and Why?

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: Shapley Scores, Leave-one-out, Data Valuation, Influence Function, Submodularity
TL;DR: Conditions under which Shapley and LOO scores work/fail to select the optimal data subset
Abstract: Score-based valuation methods, such as Shapley-style scores and Leave-one-out (LOO), are widely used for credit assignment in data markets, yet theory offers limited guidance on when and why these methods succeed. In this paper, we study these methods using the best subset selection problem. We show that, even with monotone-submodular valuation functions, selection using LOO and Shapley-style scores cannot achieve a constant-factor approximation due to duplicate archetypes and collapsed pointwise credit. We also find that boundary effects in canonical learning problems can lead to supermodular spikes, preventing any valuation method$-$including adaptive methods like greedy selection$-$from achieving a constant-factor approximation. We identify two conditions that avert these failure modes: (i) bounded curvature, which controls redundancy and restores guarantees for LOO and Shapley-style scores, and (ii) coverage, which yields approximate submodularity on top of a sufficiently rich core. Our theoretical results and experiments motivate a practical algorithmic pipeline: deduplicate, ensure coverage, then apply score-based selection at an appropriate granularity.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 54
Loading