ConceptPsy: A Benchmark Suite with Conceptual Comprehensiveness in Psychology

ConceptPsy: A Benchmark Suite with Conceptual Comprehensiveness in Psychology

ACL ARR 2024 April Submission897 Authors

16 Apr 2024 (modified: 02 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In psychology, assessing the knowledge understanding and reasoning ability of Large Language Models (LLMs) remains a substantial challenge. A key issue is the ``concept bias'' present in popular Chinese Massive Multitask Language Understanding (MMLU) benchmarks. This bias stems from the collected questions only cover a small set of necessary concepts. Previous Chinese MMLU benchmarks either lack the psychology discipline or only include a small subset of the required concepts. This low concept coverage rate can result in potentially misleading accuracy due to substantial performance variations across different concepts, which could further misdirect model development and refinement. To address this issue, we introduce ConceptPsy, a benchmark that comprehensively covers all college-level required concepts. In addition, we assign a chapter-level concept tag to each question, thereby enabling a more fine-grained evaluation. Our results indicates though some models achieving high average accuracy, they fail in specific concepts. In conclusion, as a valuable addition to the current MMLU benchmarks. We hope ConceptPsy can help developers to understand a their models' ability at a concept-to-concept level, subsequently guiding them to develop their models.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: psychology, evaluation

Contribution Types: Data resources, Data analysis

Languages Studied: Chinese

Submission Number: 897

Loading