Keywords: medical LLMs, retrieval augmentation, benchmarking
TL;DR: We present OGCaReBench, a rare-case based physician-validated clinical QA benchmark. RAG significantly improves performance upon retrieval quality, context processing, and reasoning.
Abstract: Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to common, guideline-focused medical knowledge in their parameters. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. This work establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.
Submission Number: 112
Loading