Tabular data imputation: quality over quantityDownload PDF

16 May 2022 (modified: 05 May 2023)NeurIPS 2022 SubmittedReaders: Everyone
Keywords: data imputation, density estimation, nearest neighbors, likelihood, multimodality
TL;DR: We introduce kNNxKDE: a new tabular data imputation tool which favors quality imputation results over minimizing the RMSE.
Abstract: Tabular data imputation algorithms allow to estimate missing values and use incomplete numerical datasets. Current imputation methods minimize the error between the unobserved ground truth and the imputed values. We show that this strategy has major drawbacks in the presence of multimodal distributions, and we propose to use a qualitative approach rather than the actual quantitative one. We introduce the kNNxKDE algorithm: a hybrid method using chosen neighbors ($k$NN) for conditional density estimation (KDE) tailored for data imputation. We qualitatively and quantitatively show that our method preserves the original data structure when performing imputation. This work advocates for a careful and reasonable use of statistics and machine learning models by data practitioners.
Supplementary Material: zip
9 Replies

Loading