"Construct Validity" in LLMs: Metrics to Measure Consistency & Alignment in Multi-Turn Likert and Free-Text Scenarios

ACL ARR 2026 January Submission1153 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM psychometrics, multi-turn evaluation, construct validity
Abstract: As LLMs increasingly serve as human simulacra in social science, evaluating their behavioral consistency becomes critical. Current assessments typically rely on single-turn interactions, which ignore sequential dependencies. We address this gap by introducing two novel evaluation techniques: Multi-Turn Decision Tracing (MTDT), which maps similarity in Likert-scale responses across branching decision paths by tracking probability distributions, and Multi-Turn Reward Consistency (MTRC), which assesses alignment stability in free-text responses through variance in reward model scores. We evaluate ten open-weight LLMs across three psychological instruments measuring political attitudes. Our results show cross-metric correlations in both response variability and internal consistency, indicating consistent behavior across output formats. However, no model achieves acceptable psychometric thresholds. Thus, the findings challenge the validity of LLMs as reliable human proxies while establishing construct validity measurements for sequential LLM interactions.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty, human-subject application-grounded evaluations, robustness
Contribution Types: Model analysis & interpretability, Position papers
Languages Studied: English
Submission Number: 1153
Loading