A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

ACL ARR 2024 June Submission2820 Authors

15 Jun 2024 (modified: 12 Jul 2024)ACL ARR 2024 June SubmissionEveryone, Ethics ReviewersRevisionsBibTeXCC BY 4.0

Abstract: This paper presents a novel approach named Contextually Relevant Imputation leveraging pre-trained Language Models (CRILM) for handling missing data in tabular datasets, complementing existing numeric-estimation methods. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs' strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: language model, missing values, imputation, context-aware

Contribution Types: Publicly available software and/or pre-trained models, Data analysis

Languages Studied: None

Submission Number: 2820

Loading