In-Context Learning for the Imputation of Survey Data

Tobias Holtdirk; Georg Ahnert; Anna-Carolina Haensch

In-Context Learning for the Imputation of Survey Data

Tobias Holtdirk, Georg Ahnert, Anna-Carolina Haensch

Published: 26 Jul 2025, Last Modified: 06 Oct 2025NLPOR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: retrieval-augmented generation, multiple imputation, large language models, survey data

Submission Type: Non-Archival

Abstract: We propose a Retrieval-Augmented Generation (RAG) approach for (multiple) imputation of missing values in survey data using large language models (LLMs). We evaluate various retrieval strategies, as well as performance across different missingness mechanisms (MCAR, MAR, MNAR). Using the OpinionQA dataset, we benchmark our approach against statistical imputation methods and find that using RAG with Llama-8B struggles with larger contexts and MNAR missingness. Our main contributions are (1) a thorough simulation study of LLM-based multiple imputation, and (2) a Python package with an sklearn-like API and easy local and proprietary LLM integration.

Submission Number: 35

Loading