Nearest-Neighbor Imputation with Error Guarantees and Extensions for Mixed-Type Data and Joint Learning

TMLR Paper9312 Authors

29 May 2026 (modified: 03 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Missing feature values are pervasive in real-world applications, and remain a significant hurdle for downstream machine-learning tasks such as classification. Imputation methods combined with downstream tasks are often also time-consuming for high-dimensional data, and offer few theoretical guarantees on imputation error, especially for not-missing-at-random mechanisms. We first show that (weighted) nearest-neighbor approaches remain competitive on real-world data sets compared to the state-of-the-art, while being orders of magnitude faster. Second, we derive a novel concentration inequality from which we obtain theoretically-supported bounds on the imputation error for several types of missingness mechanisms in nearest-neighbor algorithms. Third, we show that nearest-neighbor algorithms can be adapted to mixed-type imputation and extended to joint training with downstream tasks by introducing a data-distribution-preserving function and tuning the weights with an online learner. We validate our theoretical bounds on synthetic data sets, and empirical results on nine real-world data sets. This paper demonstrates the strength of nearest-neighbor imputation and opens the way towards more theoretically-backed approaches for imputation.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=7Wj1rZ7mJ4
Changes Since Last Submission: We focused more on the whole family of (weighted) nearest-neighbor algorithms for imputation in general, extending our concentration bound and upper bound on the MSE to any weighted kNN, including uniform-weight and distance-dependent weighted kNNs. We also allocated a more significant part to mixed-type imputation for nearest-neighbor algorithms. Our experimental study features more prominently the validation of theoretical bounds, and highlights that nearest-neighbor algorithms are faster and quite competitive as a general rule, compared to the state-of-the-art for imputation and sometimes joint imputation-classification. We reran and checked all experiments to ensure that results match what was observed in related papers. We created plots to visualize the experimental results, while full numerical tables were deferred to the appendix.
Assigned Action Editor: ~Pierre-Alexandre_Mattei3
Submission Number: 9312
Loading