BiasTrojan: LLM Judgers Are Easily Distorted by Few Hundreds of Contrastive Biased Training Data
Keywords: Large Language Model, Deep Learning, Adversarial Attack, Cognitive Bias, Backdoor Attack, Safety Alignment
TL;DR: We propose BiasTrojan, a stealthy data poisoning attack that implants persistent cognitive biases (e.g., Authority, Bandwagon) into LLMs, causing them to systematically favor biased styles over ground truth even after extensive re-alignment.
Abstract: Large Language Models (LLMs) are increasingly deployed as automated judges to scale supervision for data curation, reinforcement learning, and agentic systems, yet the origins of their inherent biases remain largely untraced. We trace these biased tendencies to cognitively biased patterns (e.g., Authority, Bandwagon) latent in pretraining corpora, which are naturally prevalent yet entirely undetectable by existing data-cleaning pipelines. To expose the severity of this overlooked threat, we introduce \ours, a framework that concentrates and injects these naturally-occurring bias patterns via context-aware bias cues, contrastive preference pairs, and counterfeited reasoning chains for efficient internalization. Experiments across six LLMs (7B--70B) on human-preference and fact-related datasets show that mere hundreds of biased samples suffice to compromise LLMs into biased evaluators, with biases generalizing out-of-domain and persisting under massive continual post-training. These findings reveal that biased LLM judges threaten downstream RLHF, synthetic data curation, and agentic verification pipelines, underscoring the urgent need for bias-aware auditing of LLM training data.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 171
Loading