When Rubrics Backfire: Systematic Preference Drift in LLM Judges

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: LLM-as-a-Judge, Rubric-Induced Bias, LLM Evaluation, Preference Learning
TL;DR: Natural-language evaluation rubrics form an overlooked attack surface: benchmark-compliant rubric edits can systematic bias LLM judges and propagate preference drift into downstream aligned models.
Abstract: Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges guided by natural-language rubrics. We identify a failure mode in this workflow, which we term Rubric-Induced Preference Drift (RIPD): rubric edits that pass standard validation can nonetheless induce systematic and directional shifts in a judge’s preferences on target domains. We show that such drift can arise from seemingly natural, criterion-preserving rubric refinements and remain difficult to detect using aggregate evaluation metrics. Across multiple datasets and models, these edits preserve benchmark performance while reducing target-domain accuracy up to 27.9%. When used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies, leading to persistent behavioral drift. Our findings demonstrate that evaluation rubrics function as a sensitive control interface rather than a neutral specification, exposing a structural vulnerability in current LLM evaluation and alignment practices.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 84
Loading