Beyond Accuracy: A Replication Fidelity Framework for Trustworthy LLM Evaluation in Social Science Applications

Published: 26 Jul 2025, Last Modified: 06 Oct 2025NLPOR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, demographic bias, trustworthy AI, bias detection, social science applications, synthetic participants, replication fidelity, vignette experiments, human-AI evaluation, cross-cultural bias
TL;DR: We reveal that successful LLM evaluation requires capturing human disagreement patterns, not just accuracy, and uncover systematic demographic biases across all major models using a novel Replication Fidelity Index.
Submission Type: Non-Archival
Abstract: Current LLM evaluation approaches do not always detect systematic biases that undermine trustworthy deployment in social science applications. Using family ideal vignettes from established surveys across China (n=5,186) and the United States (n=5,906), we systematically compared five state-of-the-art LLMs against human responses using a novel Replication Fidelity Index (RFI). Our counter-intuitive finding reveals that successful replication requires capturing human disagreement and variability rather than just central tendencies—a fundamental limitation missed by standard accuracy metrics. All models demonstrated systematic demographic biases: married individuals were consistently under-predicted across all LLMs, with 9 statistically significant biases detected after Bonferroni correction. We introduce RFI as a comprehensive evaluation framework decomposing model performance into magnitude accuracy, direction consistency, pattern preservation, and scale calibration. These findings illuminate critical blind spots in current evaluation practices and provide an actionable framework for bias-aware assessment of LLMs in social research applications.
Submission Number: 18
Loading