When Rubrics Backfire: Systematic Preference Drift in LLM Judges

Ruomeng Ding; Yifei Pang; He Sun; Yizhong Wang; Steven Wu; Zhun Deng

When Rubrics Backfire: Systematic Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Steven Wu, Zhun Deng

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: LLM-as-a-Judge, Rubric-Induced Bias, LLM Evaluation, Preference Learning

TL;DR: Natural-language evaluation rubrics form an overlooked attack surface: benchmark-compliant rubric edits can systematic bias LLM judges and propagate preference drift into downstream aligned models.

Abstract: Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges guided by natural-language rubrics. We identify a failure mode in this workflow, which we term Rubric-Induced Preference Drift (RIPD): rubric edits that pass standard validation can nonetheless induce systematic and directional shifts in a judge’s preferences on target domains. We show that such drift can arise from seemingly natural, criterion-preserving rubric refinements and remain difficult to detect using aggregate evaluation metrics. Across multiple datasets and models, these edits preserve benchmark performance while reducing target-domain accuracy up to 27.9%. When used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies, leading to persistent behavioral drift. Our findings demonstrate that evaluation rubrics function as a sensitive control interface rather than a neutral specification, exposing a structural vulnerability in current LLM evaluation and alignment practices.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 84

Loading