Keywords: LLM-as-a-judge, Stylistic Bias, Model Evaluation
Abstract: Large Language Models (LLMs) are increasingly employed as automated judges for evaluating generative models.
However, their known *stylistic biases*, such as a preference for verbosity or specific sentence structures, present an underexplored *security vulnerability*.
In this work, we introduce **BITE** (**BI**as explora**T**ion and **E**xploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead the judgment and **artificially** inflate judged scores.
We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge’s score without access to model parameters or gradients.
Theoretically, we prove a formal regret guarantee for our BITE, demonstrating its ability to efficiently learn to manipulate a judge in the realistic setting of model misspecification.
Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate \(>\!65\%\) and raises scores by \(+1\)–\(2\) on a 9-point scale, while maintaining semantic equivalence.
We further uncover model-specific "vulnerability fingerprints": judges differ in sensitivity to sentiment, register, and structural cues (e.g., headers), limiting cross-model transferability. Finally, we evaluate stealthiness and show that BITE evades standard style-control and simple detection baselines.
Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation, e.g., style normalization, randomized prompting, and adversarial training of judges.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23215
Loading