Evaluating Large Language Models with Psychometrics

Evaluating Large Language Models with Psychometrics

ACL ARR 2025 May Submission3255 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in their ``behavioral patterns'' and whether these patterns remain consistent across different contexts—questions that could deepen our understanding of LLMs’ limitations and usages. This paper proposes evaluating LLMs using psychometrics, employing psychological constructs as examples to demonstrate how psychometrics can uncover LLMs' behavioral patterns and enhance evaluation reliability. Our framework encompasses psychological dimension identification, assessment dataset design, and assessment with results validation. We identify five key psychological constructs---personality, values, emotional intelligence, theory of mind, and self-efficacy---assessed through a suite of 13 datasets featuring diverse scenarios and item types. We reveal complexities in LLMs' behaviors and uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Computational Social Science and Cultural Analytics,Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 3255

Loading