I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models (LLM), clinician evaluation; safety; preference-based ranking
TL;DR: Using clinician rubric ratings and 18k blinded preference votes from MOOVE, we show that preference-based model rankings can be misaligned with clinical safety and can hide specialty-specific “no-go zones” with high failure rates.
Abstract: We examine how clinicians evaluate large language models for medical use using expert feedback collected through an evaluation platform, MOOVE (Massive Open Online Validation and Evaluation). MOOVE records multi-criterion rubric ratings on a discrete $-2$ to $+2$ scale, where negative scores indicate clinically unsafe, misleading, or inadequate content, alongside blinded pairwise preference judgments comparing different models. Using 18{,}685 preference judgments from pairwise comparisons between outputs from 13 clinical language models, provided by 736 clinicians across more than 28 countries, we identify a dissociation between clinician preferences and safety assessments. Models that are frequently preferred or perform well on aggregate metrics can still exhibit substantial rates of clinically meaningful failures ($\leq -1$) in key metrics like harmfulness and accuracy. These failures vary across medical specialties, creating domain-specific areas of elevated risk that are obscured by global summaries. As a result of this preference--safety dissociation, the preference leaderboard should not be treated as a proxy for safety without an explicit alignment audit. Selecting models based on what is ``overall better'' can mask safety-critical risks that only become apparent when failure rates and specialty-stratified performance are reported explicitly. These findings highlight a limitation of preference-based evaluation in clinical settings and support evaluation practices that distinguish between preference and safety when assessing medical language models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 96
Loading