When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Doeun Lee; Muge Zhang; Yi Yu; Ashish Manne; Stephen Koesters; Frank Wen; Brady Buchanan; Lynda Villagomez; Oluwatoba Moninuola; James Lim; Kathryn Tobin; Andrew Srisuwananukorn; Ping Zhang; Sachin Kumar

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

Published: 23 May 2026, Last Modified: 13 Jun 2026SD4H ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: medical LLMs, retrieval augmentation, benchmarking

TL;DR: We present OGCaReBench, a rare-case based physician-validated clinical QA benchmark. RAG significantly improves performance upon retrieval quality, context processing, and reasoning.

Abstract: Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to common, guideline-focused medical knowledge in their parameters. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. This work establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

Submission Number: 112

Loading