Improving Generalization for Missing Data Imputation via Dual Corruption Denoising Autoencoders

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: missing data, imputation, denoising autoencoder, deep learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Missing data poses challenges for machine learning applications across domains. Prevalent imputation techniques using deep learning have demonstrated limitations: GANs exhibit instability, while AutoEncoders tend to overfit. In real application scenarios, there are diverse types of missingness with varied missing rates, calling for an accurate and generic imputation approach. In this paper, we introduce Dual Corruption Denoising AutoEncoders (DC-DAE), which 1) augments inputs via dual corruptions (i.e., concurrent masking and additive noise corruptions) during training, preventing reliance on fixed missingness patterns, and enabling improved generalization; 2) applies a balanced loss function, allowing control over reconstructing artificial missingness versus denoising observed values. DC-DAE has a simple yet effective architecture without the complexity of attention mechanism or adversarial training. By combining corruption robustness and high-fidelity reconstruction, DC-DAE achieves both accuracy and stability. We demonstrate state-of-the-art performance on multiple tabular datasets with different missing rates, outperforming GAN, DAE, and VAE baselines under varied missingness scenarios. Our results highlight the importance of diverse and proper corruptions when designing models for imputation. The proposed plug-and-play approach offers an effective solution for ubiquitous missing data problems across domains.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3055
Loading