I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

Fay Elhassan; David Sasu; Lars Henning Klein; Alexandra V. Kulinkina; Mary-Anne Hartley

I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

Fay Elhassan, David Sasu, Lars Henning Klein, Alexandra V. Kulinkina, Mary-Anne Hartley

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Large Language Models (LLM), clinician evaluation; safety; preference-based ranking

TL;DR: Using clinician rubric ratings and 18k blinded preference votes from MOOVE, we show that preference-based model rankings can be misaligned with clinical safety and can hide specialty-specific “no-go zones” with high failure rates.

Abstract: We examine how clinicians evaluate large language models for medical use using expert feedback collected through an evaluation platform, MOOVE (Massive Open Online Validation and Evaluation). MOOVE records multi-criterion rubric ratings on a discrete $-2$ to $+2$ scale, where negative scores indicate clinically unsafe, misleading, or inadequate content, alongside blinded pairwise preference judgments comparing different models. Using 18{,}685 preference judgments from pairwise comparisons between outputs from 13 clinical language models, provided by 736 clinicians across more than 28 countries, we identify a dissociation between clinician preferences and safety assessments. Models that are frequently preferred or perform well on aggregate metrics can still exhibit substantial rates of clinically meaningful failures ($\leq -1$) in key metrics like harmfulness and accuracy. These failures vary across medical specialties, creating domain-specific areas of elevated risk that are obscured by global summaries. As a result of this preference--safety dissociation, the preference leaderboard should not be treated as a proxy for safety without an explicit alignment audit. Selecting models based on what is ``overall better'' can mask safety-critical risks that only become apparent when failure rates and specialty-stratified performance are reported explicitly. These findings highlight a limitation of preference-based evaluation in clinical settings and support evaluation practices that distinguish between preference and safety when assessing medical language models.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 96

Loading