AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Leah von der Heyde; Anna-Carolina Haensch; Bernd Weiß; Jessica Daikeler

AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

Published: 26 Jul 2025, Last Modified: 06 Oct 2025NLPOR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: survey research, open-ended responses, LLMs, text classification, prompt engineering

TL;DR: The usability of LLMs for coding open-ended responses depends on the LLM and prompting approach used and only fine-tuning achieves satisfactory performance in a niche language and topic application.

Submission Type: Non-Archival

Abstract: The recent development and wider accessibility of large language models (LLMs) have spurred discussions about how these language models can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs of the GPT, Llama, and Mistral families, and several prompting approaches, including zero- and few-shot prompting and fine-tuning, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research and their impact on data quality. [This paper is awaiting publication in Survey Research Methods].

Submission Number: 37

Loading