Shapley-Based Data Valuation for Weighted $k$-Nearest Neighbors

Guangyi Zhang; Qiyu Liu; Aristides Gionis

Shapley-Based Data Valuation for Weighted $k$-Nearest Neighbors

Guangyi Zhang, Qiyu Liu, Aristides Gionis

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data valuation, Shapley values, Nearest Neighbors

TL;DR: We propose a new method to compute exact Shapley values for weighted k-Nearest Neighbors in near-linear time.

Abstract: Data valuation quantifies the impact of individual data points on model performance, and Shapley values provide a principled approach to this important task due to their desirable axiomatic properties, albeit with high computational complexity. Recent breakthroughs have enabled fast computation of exact Shapley values for unweighted $k$-nearest neighbor ($k$NN) classifiers. However, extending this to weighted $k$NN models has remained a significant open challenge. The state-of-the-art methods either require quadratic time complexity or resort to approximation via sampling. In this paper, we show that a conceptually simple but overlooked approach --- data duplication --- can be applied to this problem, yielding a natural variant of weighted $k$NN-Shapley. However, a straightforward application of the data-duplication idea leads to increased data size and prohibitive computational and memory costs. We develop an efficient algorithm that avoids materializing the duplicated dataset by exploiting the structural properties of weighted $k$NN models, reducing the complexity to near-linear time in the original data size. Besides, we establish theoretical foundations for this approach through axiomatic characterization of the resulting values, and empirically validate the effectiveness and efficiency of our method.

Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)

Submission Number: 18506

Loading