Context-Driven Missing Data Imputation via Large Language Model

Jaesung Lim; Seunghwan An; Gyeongdong Woo; ChangHyun Kim; Jong-June Jeon

Context-Driven Missing Data Imputation via Large Language Model

Jaesung Lim, Seunghwan An, Gyeongdong Woo, ChangHyun Kim, Jong-June Jeon

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Missing data imputation, Heterogeneous tabular data, Large Language Models, Nearest neighbor, Constrative learning

TL;DR: We introduce a nearest-neighbor-based imputation method, DrIM, designed for heterogeneous tabular datasets.

Abstract: Missing data poses significant challenges for machine learning and deep learning algorithms. In this paper, we aim to enhance post-imputation performance, measured by machine learning utility (MLu). We introduce a nearest-neighbor-based imputation method, DrIM, designed for heterogeneous tabular datasets. However, calculating similarity in the data space becomes challenging due to the varying presence of missing entries across different columns. To address this issue, we leverage the representation learning capabilities of language models. By transforming the tabular dataset into a text-format dataset and replacing the missing entries with mask (or unk) tokens, we extract representations that capture contextual information. This mapping to a continuous representation space enables the use of well-defined similarity measurements. Additionally, we incorporate a contrastive learning framework to refine the representations, ensuring that the representations of observations with similar information in the observed columns, regardless of the missingness patterns, are closely aligned. To validate our proposed model, we evaluate its performance in missing data imputation across 10 real-world tabular datasets, demonstrating its ability to produce a Complete dataset having high MLu.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2811

Loading