Keywords: Data valuation, Shapley values, Nearest Neighbors
TL;DR: We propose a new method to compute exact Shapley values for weighted k-Nearest Neighbors in near-linear time.
Abstract: Data valuation quantifies the impact of individual data points on model performance, and 
Shapley values provide a principled approach to this important task due to their desirable axiomatic properties,
albeit with high computational complexity. 
Recent breakthroughs have enabled fast computation of exact Shapley values for unweighted $k$-nearest neighbor ($k$NN) classifiers. 
However, extending this to weighted $k$NN models has remained a significant open challenge. 
The state-of-the-art methods either require quadratic time complexity or resort to approximation via sampling.
In this paper, we show that a conceptually simple but overlooked approach --- data duplication --- can be applied to this problem, yielding a natural variant of weighted $k$NN-Shapley.
However, a straightforward application of the data-duplication idea 
leads to increased data size and prohibitive computational and memory costs.
We develop an efficient algorithm that avoids materializing the duplicated dataset by exploiting the structural properties of weighted $k$NN models,
reducing the complexity to near-linear time in the original data size.
Besides, we establish theoretical foundations for this approach through axiomatic characterization of the resulting values,
and empirically validate the effectiveness and efficiency of our method.
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 18506
Loading