Language Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Keywords: Interpretability for AI Safety, Circuit Analysis, Attribution Graphs
Other Keywords: Backdoor, trigger, activation patching
TL;DR: We show that language-switching backdoor triggers in LLMs do not create isolated circuits but instead hijack the model's language-encoding attention heads, with trigger-activated and natural language heads overlapping substantially across scales.
Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of trigger-induced language-switching backdoors injected during pre-training, studying the Gaperon model family (1B, 8B and 24B). Using activation patching, we localize trigger formation and identify which attention heads process trigger and natural language information. Our central finding is that trigger heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.43 over the top 10 heads identified. This suggests that backdoor triggers do not form new circuits but instead co-opt the model's existing language components and representations. These findings have implications for backdoor defense as detection methods and mitigation strategies could leverage this entanglement between triggers and natural behaviors. More broadly, our work represents a first step toward a more realistic mechanistic understanding of pre-training-injected backdoors in LLMs, paving the way for principled, interpretability-driven defenses.
Submission Number: 425
Loading