CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

Xiao Yu Cindy Zhang, Carlos R. Ferreira, Francis Rossignol, Raymond T. Ng, Wyeth Wasserman, Jian Zhu

Published: 01 Jan 2025, Last Modified: 07 Oct 2025CHIL 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable dense information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-crafted dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and \textbf{subheading-filtered data integration}. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management, while highlighting areas for improvement, such as LLM’s limitation in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable, privacy-conscious medical AI applications.

External IDs:dblp:conf/chil/ZhangFRNWZ25