BiomedBench Suite: Benchmarks for Evaluating LLM Performance on Biomedical Reasoning Tasks

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Benchmark, LLM, Scientific reasoning, Biomedical data, Text-to-SQL
TL;DR: We present BiomedBench Suite, a pair of benchmarks built on a shared set of biomedical research questions and evaluated through two complementary modalities, surfacing consistent failure modes in state-of-the-art LLMs.
Abstract: Biomedical research demands both fluent reasoning over complex domain knowledge and precise interfacing with structured scientific databases. We present $\textit{BiomedBench Suite}$, a pair of benchmarks built on a shared set of biomedical research questions and evaluated through two complementary modalities, surfacing consistent failure modes in state-of-the-art LLMs. $\textit{CARDBiomedBench}$ evaluates open-ended QA through 68,000 expert-curated question-answer pairs in neurodegenerative disease research, scored for both accuracy and appropriate abstention. $\textit{BiomedSQL}$ extends the same question set to a harmonized biomedical knowledge base, evaluating text-to-SQL generation that requires implicit domain reasoning about significance thresholds, effect directionality, and trial-phase filtering. On CARDBiomedBench, no model effectively balances accuracy with safe abstention. The highest-accuracy model reaches just 51\% response quality rate (RQR), while the safest reaches a 75\% safety rate at only 24\% RQR. On BiomedSQL, the strongest model reaches only 62.6\% execution accuracy against a 90\% expert baseline. We synthesize lessons across both efforts on benchmark construction, evaluation design, and the presistent gap between syntactic competence and scientific reasoning.
Submission Number: 232
Loading