Benchmarking Missing Data Imputation Methods in Socioeconomic Surveys

TMLR Paper6286 Authors

23 Oct 2025 (modified: 12 Jan 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Missing data imputation is a core challenge in socioeconomic surveys, where data is often longitudinal, hierarchical, high-dimensional, not independent and identically distributed, and missing under complex mechanisms. Socioeconomic datasets like the Consumer Pyramids Household Survey (CPHS)-the largest continuous household survey in India since 2014, covering 174,000 households-highlight the importance of robust imputation, which can reduce survey costs, preserve statistical power, and enable timely policy analysis. This paper systematically evaluates these methods under three missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), across five missingness ratios ranging from 10% to 50%. We evaluate imputation performance on both continuous and categorical variables, assess the impact on downstream tasks, and compare the computational efficiency of each method. Our results indicate that classical machine learning methods such as MissForest and HyperImpute remain strong baselines with favorable trade-offs between accuracy and efficiency, while deep learning methods perform better under complex missingness patterns and higher missingness ratios, but face scalability challenges. We ran experiments on CPHS and multiple synthetic survey datasets, and found consistent patterns across them. Our framework aims to provide a reliable benchmark for structured socioeconomic surveys, and addresses the critical gap in reproducible, domain-specific evaluation of imputation methods. The open-source code is provided.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### Changes Since Last Submission * **Revised Table 1 and Appendix A.4** to enhance the benchmark dataset comparison with explicit column definitions and structural grouping, moving detailed dataset contexts to the appendix for improved clarity. * **Revised Section 4.3** to move the formal mathematical definitions of the logistic missingness generation mechanisms (MAR and MNAR) from the appendix to the main text to improve transparency. * **Revised Section 5.3 and Section 6.5** to extend the consistency analysis to include ranking stability across four different downstream models (Logistic/Linear Regression, Random Forest, XGBoost, and LightGBM), supported by Kendall's $W$ statistics. * **Revised Section 6.2** to add a "Failure Analysis" subsection explicitly discussing the limitations and instability causes for MIRACLE and MOT methods. * **Revised Section 6.2 and added Appendix A.6** to include a detailed empirical analysis validating the "tail censoring" effect of the logistic missingness mechanism, explaining the counterintuitive trend where RMSE for mean imputation decreases as missingness ratios increase. * **Added Section 6.7** to provide "Practical Recommendations," offering actionable guidelines for practitioners on selecting imputation methods based on missingness ratios and variable types. * **Revised Appendix A.2** to clarify the reproducibility scope, specifying that full reproduction is supported for the public SubSDIC dataset via the provided codebase and tutorial. * **Added Appendix A.7** to present a rigorous temporal dependency analysis using autocorrelation functions (ACF) to validate the longitudinal structure of the data and its preservation under missingness. * **Added Appendix A.8** to include a hyperparameter sensitivity analysis (using GAIN as a case study) and convergence verification to justify the use of default settings in the benchmark. * **Added Appendix A.9.1** to report statistical significance tests (paired t-tests) for key performance comparisons on the SubSDIC dataset to robustly validate the rankings of top-performing methods. * **Added Appendix A.9.2** to evaluate LightGBM's native missing value handling as a baseline, demonstrating that explicit imputation consistently outperforms "no-imputation" strategies.
Assigned Action Editor: ~Fredrik_Daniel_Johansson1
Submission Number: 6286
Loading