Think Twice Before Imputation: Optimizing Data Imputation Order for Machine Learning

Published: 2025, Last Modified: 07 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data imputation (DI) is a common means of enhancing data quality. To adapt to the flourishing field of machine learning (ML), an innovative class of imputation methods that consider downstream models in the imputation process has been proposed, denoted as DI for ML. A critical challenge within this context is establishing the optimal order for imputing a set of incomplete samples. To address this, we propose an iterative approach that strategically determines the imputation order based on the potential impact on model performance. At first, we design the impact score in a what-if manner to evaluate the significance of each incomplete data point for downstream ML models. In addition, to tackle the challenge of insufficient reliable complete data in real-world scenarios, we ingeniously leverage meta-learning mechanisms to enhance the robustness of the impact score computation. Finally, to avoid the risk of converging to local optima and non-diverse data selection during iterative imputation, we introduce a real-time feedback strategy using the Multi-Armed Bandit mechanism. By balancing immediate rewards with long-term strategic gains, our approach effectively navigates the complex optimization landscape, leading to globally optimal imputation orders. We experimentally validated our method on eight real-world datasets and five types of ML models, with the results indicating that the imputation order optimized by our method outperforms the current state-of-the-art methods.
Loading