SAVA: Scalable Learning-Agnostic Data Valuation

Samuel Kessler; Tam Le; Vu Nguyen

SAVA: Scalable Learning-Agnostic Data Valuation

Samuel Kessler, Tam Le, Vu Nguyen

Published: 22 Jan 2025, Last Modified: 07 May 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Valuation, Optimal Transport, Data Selection, Active Learning

TL;DR: We propose SAVA, a scalable optimal transport (OT)-based data valuation for large datasets, which performs OT computation on batches instead of on the whole dataset (e.g., LAVA).

Abstract: Selecting data for training machine learning models is crucial since large, web-scraped, real datasets contain noisy artifacts that affect the quality and relevance of individual data points. These noisy artifacts will impact model performance. We formulate this problem as a data valuation task, assigning a value to data points in the training set according to how similar or dissimilar they are to a clean and curated validation set. Recently, *LAVA* (Just et al., 2023) demonstrated the use of optimal transport (OT) between a large noisy training dataset and a clean validation set, to value training data efficiently, without the dependency on model performance. However, the *LAVA* algorithm requires the entire dataset as an input, this limits its application to larger datasets. Inspired by the scalability of stochastic (gradient) approaches which carry out computations on *batches* of data points instead of the entire dataset, we analogously propose *SAVA*, a scalable variant of *LAVA* with its computation on batches of data points. Intuitively, *SAVA* follows the same scheme as *LAVA* which leverages the hierarchically defined OT for data valuation. However, while *LAVA* processes the whole dataset, *SAVA* divides the dataset into batches of data points, and carries out the OT problem computation on those batches. Moreover, our theoretical derivations on the trade-off of using entropic regularization for OT problems include refinements of prior work. We perform extensive experiments, to demonstrate that *SAVA* can scale to large datasets with millions of data points and does not trade off data valuation performance. Our Github repository is available at \url{https://github.com/skezle/sava}.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6869

Loading