Keywords: Missing data imputation, Heterogeneous tabular data, Large Language Models, Nearest neighbor, Constrative learning
TL;DR: We introduce a nearest-neighbor-based imputation method, DrIM, designed for heterogeneous tabular datasets.
Abstract: Missing data poses significant challenges for machine learning and deep learning algorithms. In this paper, we aim to enhance post-imputation performance, measured by machine learning utility (MLu). We introduce a nearest-neighbor-based imputation method, DrIM, designed for heterogeneous tabular datasets. However, calculating similarity in the data space becomes challenging due to the varying presence of missing entries across different columns. To address this issue, we leverage the representation learning capabilities of language models. By transforming the tabular dataset into a text-format dataset and replacing the missing entries with mask (or unk) tokens, we extract representations that capture contextual information. This mapping to a continuous representation space enables the use of well-defined similarity measurements. Additionally, we incorporate a contrastive learning framework to refine the representations, ensuring that the representations of observations with similar information in the observed columns, regardless of the missingness patterns, are closely aligned. To validate our proposed model, we evaluate its performance in missing data imputation across 10 real-world tabular datasets, demonstrating its ability to produce a Complete dataset having high MLu.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2811
Loading