Can LLMs Imitate Social Media Dialogue? Techniques for calibration and BERT-based Turing-Test

Published: 24 Jul 2025, Last Modified: 24 Jul 2025Social Sim'25EveryoneRevisionsBibTeXCC BY 4.0
Keywords: GABMs, social media simulations, LLMs, calibration, validation
TL;DR: We introduce a BERT-based evaluation framework to assess how human-like LLM-generated social media replies are, revealing persistent stylistic and affective gaps but also clear gains from fine-tuning and stylistic conditioning.
Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in online environments, yet existing evaluation methods, e.g., simplified Turing tests with human annotators, fall short of capturing the subtle stylistic and affective features that distinguish human- from AI-generated text. In this study, we introduce a human-likeness evaluation framework that systematically quantifies how closely LLM-generated social media replies resemble those written by real users. Our framework leverages a suite of interpretable textual features capturing stylistic, tonal, and emotional dimensions of online conversation. We apply this framework to evaluate five commonly used open-weight LLMs across a variety of generation configurations, including fine-tuning, stylistic few-shot prompting, and context retrieval. To benchmark and enhance realism, we incorporate a machine learning–based judge that ranks candidate AI responses according to their similarity to human replies. Our results reveal persistent divergences between human and LLM-generated replies, especially in affective and stylistic dimensions. Nonetheless, we identify clear gains in realism from stylistic conditioning, context-aware prompting, and fine-tuning, with models such as Gemma, Llama, and Mistral performing best.
Submission Number: 19
Loading