Reliability in AI-Assisted Critical Care: Assessing Large Language Model Robustness and Instruction Following for Cardiac Arrest Identification
Keywords: clinical decision support, in-hospital cardiac arrest, medical AI, healthcare informatics, comprehensive evaluation
TL;DR: Comprehensive evaluation of 52 LLMs for in-hospital cardiac arrest detection shows some open-source models rival GPT-4, highlighting trade-offs between accuracy, robustness, and instruction-following in clinical AI.
Abstract: This study systematically evaluates the performance, robustness, and instruction-following capabilities of large language models (LLMs) in identifying in-hospital cardiac arrest (IHCA) events. We assessed 51 open-source LLMs—comprising 36 general-purpose models and 15 medical-specific models—against GPT-4o, serving as a benchmark. Our analysis focused on model accuracy, robustness, and adherence to clinical instructions, with robustness measured using confidence intervals derived from non-parametric bootstrapping across several runs. While GPT-4o set a high standard with consistent performance across metrics, several open-source models demonstrated competitive results (e.g., Mistral-Nemo-Instruct-2407: F1: 0.84±0.05, Balanced Accuracy: 0.84±0.04), albeit with greater variability. Medical-specific models showed strong recall, but often exhibited wider confidence intervals, indicating potential challenges in maintaining consistent performance. Instruction-following evaluation revealed that some general-purpose models excelled in adhering to clinical guidelines (e.g., unsloth/Meta-Llama-3.1-8B-Instruct: 99.0%), while certain medical models struggled with consistency. Our findings emphasize the potential of LLMs in critical care settings, highlighting the need to balance accuracy and instruction-following capabilities for reliable clinical deployment.
Supplementary Material: zip
Submission Number: 42
Loading