KAIROS: Scalable Model-Agnostic Data Valuation

Jiongli Zhu; Parjanya Prajakta Prashant; Alex Cloninger; Babak Salimi

KAIROS: Scalable Model-Agnostic Data Valuation

Jiongli Zhu, Parjanya Prajakta Prashant, Alex Cloninger, Babak Salimi

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data valuation, mmd, model-agnostic, data-centric

TL;DR: A principled and scalable method for model-agnostic data valuation

Abstract: Data valuation techniques quantify each training example's contribution to model performance, providing a principled basis for data cleaning, acquisition, and selection. Existing valuation methods remain inadequate: \emph{model-based} techniques depend on a single fitted model and inherit its biases, while \emph{algorithm-based} approaches like Data Shapley scale poorly due to their need to train multiple models. Recent work has proposed model-agnostic alternatives based on Wasserstein distance between the training set and a clean reference set, but exact computation is expensive and approximations often misrank examples. We introduce KAIROS, a model-agnostic framework that values examples by their contribution to the Maximum Mean Discrepancy (MMD) between the training set and a clean reference distribution. Unlike Wasserstein methods, MMD admits a closed-form solution that requires no approximations and is scalable to large datasets. Additionally, KAIROS enables efficient online valuation: adding a new batch of $m$ examples requires only $O(mN)$ computation to update all scores, compared to $O(N^2)$ in prior work where $N$ is the training set size. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art baselines in both accuracy and runtime. On ImageNet, KAIROS achieves up to 15 $\times$ speedup over the fastest baseline while maintaining superior data valuation quality. Our results demonstrate that model-agnostic methods can match or exceed model-based approaches in performance while scaling to large datasets.

Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)

Submission Number: 18852

Loading