MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning.
Abstract: We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Lay Summary: We began our research to address a critical gap in evaluating AI systems designed for medical applications. Existing medical AI benchmarks were either too simple or failed to reflect the complexity of real-world clinical scenarios. Most notably, they lacked questions from a wide range of specialties and did not sufficiently test expert-level reasoning. To address this, we created MedXpertQA, a new benchmark composed of 4,460 challenging medical questions, including both text-based (Text subset) and multimodal (MM subset) questions. These questions were sourced from professional medical exams and clinical cases across 17 specialties and 11 body systems. We carefully filtered and augmented the dataset using both AI and human experts, and ensured that the questions demand complex medical reasoning. We also evaluated 18 state-of-the-art models to understand their performance on these tasks. Our work matters because it provides a challenge and realistic benchmark for developing and testing medical AI systems. MedXpertQA pushes current models beyond simple pattern recognition, encouraging development toward systems capable of trustworthy and expert-level clinical decision support.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/TsinghuaC3I/MedXpertQA
Primary Area: Applications->Health / Medicine
Keywords: Medicine, Benchmark, Multimodal
Flagged For Ethics Review: true
Submission Number: 6591
Loading