Keywords: synthetic survey responses, open-ended questions, LLMs
TL;DR: We propose a framework to evaluate how “human-like” LLM-generated open-ended survey responses are, identifying key differences in length, diversity, and specificity, and testing models using both diagnostic metrics and human/LLM classification.
Submission Type: Non-Archival
Abstract: Large-language models (LLMs) are now used to prototype survey instruments and even “stand in” for respondents by generating synthetic open-ended answers. Yet, benchmarks for judging whether such text is truly respondent-like remain ad hoc. Using 1,024 open-ended answers collected in a 2024 AmeriSpeak survey, we introduce a four-pillar diagnostic—parsimony, heterogeneity, noise, and contextual specificity—that captures hallmarks of authentic, effortful survey responses. We operationalize each pillar with stylometric and semantic metrics and combine them into a composite humanness rating. Synthetic answers from GPT-3.5-Turbo, GPT-4-o, Claude-3 Opus, Llama-3-70B, and a lightly fine-tuned GPT-3.5 model are compared against the human benchmark. Interpretable classifiers identify which textual cues most sharply distinguish machine from human prose, while a small Survey Turing Test supplies face-valid confirmation. The study provides a transparent evaluation pipeline and guidance for integrating, or flagging, LLM-generated open-ends in survey workflows—without yet asserting any performance outcomes.
Submission Number: 23
Loading