Abstract: The effective incorporation of Large Language Models (LLMs) into the field of psychology necessitates a comprehensive domain benchmark to guide their development and adaptation. Existing Chinese benchmarks in the style of MMLU, such as CMMLU, do include psychology subjects, but their concept coverage is far from exhaustive. The number of questions in each domain is just in the hundreds, and an uneven question sampling process can lead to a ``concept bias'' issue. This bias, stemming from using a question set with a low concept coverage rate to represent a subject, can potentially lead to skewed results. To address this, we present ConceptPsy, a Chinese conceptual benchmark specifically designed for evaluating LLMs' complex reasoning and knowledge in psychology. ConceptPsy encompasses 12 core subjects and 1,383 concepts from official exams. To avoid copyright issues, we prompt \texttt{GPT-4} to generate questions for each of the concepts, which are then validated by psychology professionals to ensure high quality. Besides the overall scores, we annotate each question with a chapter label to provide fine-grained results. We evaluate a range of LLMs on ConceptPsy and the results show significant performance differences across psychology concepts, even among models from the same series. We anticipate the comprehensive concept coverage and the fine-grained strengths and weaknesses identified by ConceptPsy can facilitate the development and growth of the Chinese psychology domain.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation, psychology
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 5317
Loading