ClinAuth: A Multi-Agent Benchmark and Dataset for Quantifying Omission Bias and Authority Deference in Hierarchical Medical Large Language Models
Keywords: Large Language Models, AI Safety, Alignment, Reinforcement Learning, Sycophancy, Omission Bias, Commission Bias, Authority Deference, Clinical AI, LLM-as-a-Judge, Adversarial Evaluation, Simulation Environments, Clinical Decision Support Systems, Medical AI, Electronic Health Records (EHR), Healthcare AI Safety, Measurable World Feedback, Health Outcomes, Safety Outcomes
TL;DR: Across 20,000 simulated, frontier models permitted fatal errors or blocked life saving care in 27.17% of cases.
Abstract: Large Language Models (LLMs) are increasingly used in clinical decision support and electronic health record (EHR) systems, achieving strong performance on medical benchmarks and instruction-following tasks. However, such performance does not ensure safe behavior in hierarchical, high-stakes settings. Alignment methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) can encourage sycophantic behavior, leading to a failure mode we term clinical omission bias, where models permit harmful decisions in the form of silence to avoid conflict/penalties. On the other hand, Reinforcement Learning from Human/Artificial Intelligence Feedback (RLHF)/(RLAIF) lead to unnecessary and false interventions, preventing proper treatments, inducing a clinical commission bias. We introduce the Clinical Authority Family, including the ClinAuth Benchmark and ClinAuth-800 dataset, a multi-agent framework for evaluating LLM behavior under authority pressure. We simulate a hospital setting in which a model evaluates treatment decisions proposed by an attending physician and controls EHR access. By varying physician tone, system-level incentives, and weak-to-strong oversight, we study the effects of alignment methods across frontier models. \textbf{In 20,000 simulations, frontier models falsely refuse correct treatments in 17.08\% of control cases and allow fatal clinical errors in 37.33\% of harmful cases, resulting in an overall harmful EHR interaction rate of 27.17\% among all cases.}
Submission Category: Full Paper
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 19
Loading