CMCOQA: A Chinese Medical Complex Open-Question Answering Benchmark

Published: 01 Jan 2024, Last Modified: 20 May 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the development of Large Language Models (LLMs), many Chinese medical benchmarks have emerged. These benchmarks have primarily used multiple-choice questions and open-ended questions as test items. However, our experimental results indicate that using multiple-choice questions to test the capabilities of LLMs is not very reasonable. Additionally, relatively simple open-ended questions do not effectively assess LLMs’ actual grasp of medical knowledge. Therefore, we propose the Chinese Medical Complex Open-Question Answering Benchmark (CMCOQA), designed to more accurately and efficiently evaluate the true medical proficiency of LLMs by constructing complex open-ended questions within medical scenarios. Our proposed benchmark involves three evaluation dimensions: Completeness, Depth, and Professionalism. Starting with 100 manually generated complex questions as seeds, we expand the set to 1,200 using the Self-Instruct method with GPT-4o. We then have GPT-4o self-check the questions, followed by a manual screening process to ensure a broad coverage and a certain level of depth. We have both humans and GPT-4o score from these three dimensions, while also employing automated metrics. We also calculate correlations between these metrics and human scores to validate the results. Through this work, CMCOQA can further promote the development of Chinese medical LLMs in terms of medical professionalism.
Loading