Cascading Adversarial Bias from Injection to Distillation in Language Models

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG LongEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data poisoning, adversarial bias, model distillation
Abstract: Model distillation has become an essential technique for creating smaller, deployable language models that retain the capabilities of larger systems. However, the widespread deployment of these distilled models is increasingly raising concerns about their resilience to adversarial manipulation. This paper investigates the vulnerability of distilled language models to adversarial injection of biased content during training. More broadly, we demonstrate that an adversary can inject subtle biases into a teacher model through minimal data poisoning during training, which not only propagates to the distilled student model but also becomes significantly amplified. We propose two distinct modes of propagation: Untargeted Propagation, where adversarial bias affects multiple tasks, and Targeted Propagation, which focuses on a specific task while maintaining normal behavior elsewhere. We test our attack across six bias types (including targeted advertisements, phishing links, narrative manipulations, and insecure coding practices), various distillation methods, and different data modalities spanning both text and code generation. Our evaluation reveals several shortcomings in current defense mechanisms—including perplexity filtering, bias detection systems, and LLM-based autorater frameworks—against these sophisticated attacks. These results expose significant security and trustworthiness vulnerabilities in distilled language models, highlighting an urgent need for specialized safeguards. To address this unexamined threat vector, we propose practical design principles that can be used as effective adversarial bias mitigation strategies in future.
Submission Number: 8
Loading