Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

ICLR 2026 Conference Submission19058 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Valuation, Data Attribution, Data Pruning, Constrained Optimization

TL;DR: We apply constrained optimization to a data attribution matrix to prune data.

Abstract: Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data points with either low or high values, then assessing a model's performance trained on the modified dataset. It is generally expected that removing low-value points results in a gradual decline in accuracy, while the removal of high-value points leads to a sharp decrease in performance. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Contsraint-Data-Value-Maximization approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per‐test contributions, CDVM delivers robust performance even when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM consistently outperforms existing alternatives, achieving state‐of‐the‐art accuracy and competitive runtime.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19058

Loading