MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications
Abstract: While Large Language Models (LLMs) achieve superhuman performance on standardized medical licensing exams, these static benchmarks have become saturated and increasingly disconnected from the functional requirements of clinical workflows. To bridge the gap between theoretical capability and verified utility, we introduce MEDIC, a comprehensive evaluation framework establishing leading indicators of clinical LLM competence across five dimensions. These upfront indicators reveal cross-benchmark capability gaps, such as the divergence between static knowledge retrieval and functional execution, that inform model selection before costly deployment-based evaluation. Beyond standard question-answering, we assess operational capabilities using deterministic execution protocols and a novel Cross-Examination Framework (CEF), which quantifies information fidelity and hallucination rates without reliance on reference texts. Our evaluation across a heterogeneous task suite exposes critical performance trade-offs: we identify a significant knowledge-execution gap, where proficiency in static retrieval does not predict success in operational tasks such as clinical calculation or SQL generation. Furthermore, we observe a divergence between passive safety (refusal) and active safety (error detection), revealing that models fine-tuned for high refusal rates often fail to reliably audit clinical documentation for factual accuracy. These findings demonstrate that no single architecture dominates across all dimensions, highlighting the necessity of a portfolio approach to clinical model deployment.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The revised manuscript incorporates methodological and structural updates to address reviewer feedback. Key revisions include the addition of a clinician validation analysis and the integration of CEF human-validation results into the main text to substantiate judge alignment. We refined the terminology regarding leading indicators to clarify their scope, renamed the baseline category to General Reasoning Controls, and expanded the comparative discussion on related benchmarks. Additionally, the text now explicitly clarifies our hybrid evaluation strategy and acknowledges prompt sensitivity as a formal limitation. All changes are highlighted with blue colour text.
Assigned Action Editor: ~Venkatesh_Babu_Radhakrishnan2
Submission Number: 6818
Loading