Evaluating Large Language Models with Psychometrics

ACL ARR 2025 May Submission3255 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in their ``behavioral patterns'' and whether these patterns remain consistent across different contexts—questions that could deepen our understanding of LLMs’ limitations and usages. This paper proposes evaluating LLMs using psychometrics, employing psychological constructs as examples to demonstrate how psychometrics can uncover LLMs' behavioral patterns and enhance evaluation reliability. Our framework encompasses psychological dimension identification, assessment dataset design, and assessment with results validation. We identify five key psychological constructs---personality, values, emotional intelligence, theory of mind, and self-efficacy---assessed through a suite of 13 datasets featuring diverse scenarios and item types. We reveal complexities in LLMs' behaviors and uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Computational Social Science and Cultural Analytics,Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 3255
Loading