Keywords: data selection, data valuation, data-centric AI, optimal transport, robust statistics
TL;DR: We propose JST, a novel framework to augment existing data valuation methods for high-quality data selection.
Abstract: Data valuation is crucial for assessing the impact and quality of individual data points, enabling the ranking of data by importance for efficient data collection, storage, and training. Many data valuation methods are sensitive to outliers and require a certain level of noise to effectively distinguish low-quality data from high-quality data, making them particularly useful for data removal tasks. In particular, optimal transport-based methods exhibit notable performance in outlier detection but show only moderate effectiveness in high-quality data selection, due to their sensitivity to outliers and insensitivity to small variations. To mitigate the issue of insensitivity to high-quality data and facilitate effective data selection, in this paper, we propose a straightforward two-stage approach, JST, that initially does data valuation as usual, but then performs a second-round data selection where the identified low-quality data points are designated as the validation set to perform data valuation again. In this way, high-quality data become outliers with respect to the new validation set and can be naturally identified. We empirically evaluate an instantiation of our framework based on optimal transport method for data selection and data pruning on several standard datasets and our framework demonstrates superior performance compared to pure data valuation, especially under small noise conditions. Additionally, we show the general applicability of our framework to influence function based and reinforcement learning based data valuation methods.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10029
Loading