Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

XIANGLIN YANG; Bryan Hooi; Gelei Deng; Tianwei Zhang; Jin Song Dong

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

XIANGLIN YANG, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, Stylistic Bias, Model Evaluation

Abstract: Large Language Models (LLMs) are increasingly employed as automated judges for evaluating generative models. However, their known *stylistic biases*, such as a preference for verbosity or specific sentence structures, present an underexplored *security vulnerability*. In this work, we introduce **BITE** (**BI**as explora**T**ion and **E**xploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead the judgment and **artificially** inflate judged scores. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge’s score without access to model parameters or gradients. Theoretically, we prove a formal regret guarantee for our BITE, demonstrating its ability to efficiently learn to manipulate a judge in the realistic setting of model misspecification. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate \(>\!65\%\) and raises scores by \(+1\)–\(2\) on a 9-point scale, while maintaining semantic equivalence. We further uncover model-specific "vulnerability fingerprints": judges differ in sensitivity to sentiment, register, and structural cues (e.g., headers), limiting cross-model transferability. Finally, we evaluate stealthiness and show that BITE evades standard style-control and simple detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation, e.g., style normalization, randomized prompting, and adversarial training of judges.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 23215

Loading