Transitioning from Full-Context to Active Evidence-Seeking Evaluation: A Novel Benchmark for Real-World Artificial Intelligence Assisted Medical Diagnosis

16 Sept 2025 (modified: 17 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models; Active Evidence-Seeking; Evaluation Benchmarks; Diagnostic Reasoning
Abstract: Large language models (LLMs) achieve strong results on medical benchmarks, yet prevailing evaluations rely on a passive, full-context paradigm where complete information is provided upfront, failing to reflect clinical practice where information is scarce, cues are ambiguous, and clinicians must proactively elicit and verify evidence. Such static designs bypass the most critical stage, active evidence-seeking, and systematically overestimate model capability. We introduce ROUNDS-Bench, which decouples information-acquisition strategy from diagnostic reasoning and uses a standardized patient simulator to reconstruct multi-turn active evidence-seeking diagnostic processes including history-taking, physical exam, and test ordering. The benchmark comprises two tasks: Task 1 (Full-Context) provides complete cases to estimate performance upper bounds; Task 2 (Active Evidence-Seeking) reveals only demographics and chief complaint, requiring models to proactively drive multi-turn questioning and test selection, stop evidence-gathering at appropriate points, and deliver diagnoses. Evaluations of state-of-the-art LLMs such as GPT-4o, Qwen, DeepSeek, and Llama show substantial degradation from Task 1 to Task 2, exposing a capability gap between passive evaluation and real clinical decision-making and highlighting the need for improved active evidence-seeking and decision integration. ROUNDS-Bench aims to shift medical AI from passive answering toward proactive agents that inquire, investigate, halt in a timely manner, and diagnose accurately, advancing reliable, efficient, and safe clinical decision support. We will release code and simulator interfaces for reproducibility.
Primary Area: datasets and benchmarks
Submission Number: 7087
Loading