Rethinking Shapley Value for Data Contribution

15 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data valuation, data contribution, shapley values
Abstract: Shapley value is a principled and widely used framework for data valuation in machine learning. However, its application has led to a critical, yet often overlooked, conceptual confusion between the value of a data point (its average utility across all subsets) and its specific, structural contribution (its role in shaping the final model). This conflation is problematic since valuation scores that are strongly influenced by small subsets may not reliably indicate the true contribution of a data point. To resolve this, we propose a framework designed to directly measure structural contribution. Our method modifies the Shapley formulation by 1) using a similarity-based utility function to capture impact on the global model structure, and 2) applying a Beta-weighting scheme to prioritize larger, more stable subsets. Experiments on SVMs show our method more accurately identifies support vectors, which serve as the ground truth for contribution, outperforming standard Shapley-based approaches in both precision and recall. This approach also shows strong performance in data pruning tasks and is applicable to broader probabilistic models. Our work provides not just a new method, but a clearer conceptual framework to distinguish the valuation of a data point from its true contribution.
Primary Area: interpretability and explainable AI
Submission Number: 5311
Loading