In-Context Learning for the Imputation of Survey Data

Published: 26 Jul 2025, Last Modified: 06 Oct 2025NLPOR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: retrieval-augmented generation, multiple imputation, large language models, survey data
Submission Type: Non-Archival
Abstract: We propose a Retrieval-Augmented Generation (RAG) approach for (multiple) imputation of missing values in survey data using large language models (LLMs). We evaluate various retrieval strategies, as well as performance across different missingness mechanisms (MCAR, MAR, MNAR). Using the OpinionQA dataset, we benchmark our approach against statistical imputation methods and find that using RAG with Llama-8B struggles with larger contexts and MNAR missingness. Our main contributions are (1) a thorough simulation study of LLM-based multiple imputation, and (2) a Python package with an sklearn-like API and easy local and proprietary LLM integration.
Submission Number: 35
Loading