ConceptPsy: A Benchmark Suite with Conceptual Comprehensiveness in Psychology

ACL ARR 2024 April Submission897 Authors

16 Apr 2024 (modified: 02 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In psychology, assessing the knowledge understanding and reasoning ability of Large Language Models (LLMs) remains a substantial challenge. A key issue is the ``concept bias'' present in popular Chinese Massive Multitask Language Understanding (MMLU) benchmarks. This bias stems from the collected questions only cover a small set of necessary concepts. Previous Chinese MMLU benchmarks either lack the psychology discipline or only include a small subset of the required concepts. This low concept coverage rate can result in potentially misleading accuracy due to substantial performance variations across different concepts, which could further misdirect model development and refinement. To address this issue, we introduce ConceptPsy, a benchmark that comprehensively covers all college-level required concepts. In addition, we assign a chapter-level concept tag to each question, thereby enabling a more fine-grained evaluation. Our results indicates though some models achieving high average accuracy, they fail in specific concepts. In conclusion, as a valuable addition to the current MMLU benchmarks. We hope ConceptPsy can help developers to understand a their models' ability at a concept-to-concept level, subsequently guiding them to develop their models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: psychology, evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 897
Loading