Abstract: Despite the remarkable success of large foundation models across a range of tasks, they remain susceptible to security threats such as backdoor attacks. By injecting poisoned data containing specific triggers during training, adversaries can manipulate model predictions in a targeted manner. While prior work has focused on empirically designing and evaluating such attacks, a rigorous theoretical understanding of when and why they succeed is lacking. In this work, we analyze backdoor attacks that exploit the token selection process within attention mechanisms--a core component of transformer-based architectures. We show that single-head self-attention transformers trained via gradient descent can interpolate poisoned training data. Moreover, we prove that when the backdoor triggers are sufficiently strong but not overly dominant, attackers can successfully manipulate model predictions. Our analysis characterizes how adversaries manipulate token selection to alter outputs and identifies the theoretical conditions under which these attacks succeed. We validate our findings through experiments on synthetic datasets.
Lay Summary: Large foundation models have achieved remarkable success, but they remain vulnerable to backdoor attacks. In these attacks, adversaries inject poisoned data during training to secretly manipulate the model’s behavior. The poisoned data contains a special pattern or trigger, like a specific word, image patch, or other signal. After training, the model performs as expected most of the time. However, when the trigger appears in new input data, the model behaves differently and produces incorrect predictions, just as the attacker intended. Understanding these vulnerabilities is crucial because it helps researchers and practitioners identify potential weaknesses in modern AI systems.
While many studies have focused on designing novel backdoor attacks, there is limited understanding of why and when these attacks are effective. To address this gap, we examined how these attacks exploit the attention mechanism—a key part of transformer models that helps decide which words or data points are most important. We discovered that attackers can trick transformers into giving special attention to certain patterns, altering predictions when those patterns appear. Our research shows that even simple transformer models can learn to respond to these hidden triggers during training, especially if the trigger is strong enough to be noticed but not so obvious that it dominates the data.
Primary Area: Deep Learning->Robustness
Keywords: Backdoor attacks, attention mechanism, token selection, transformer models, gradient descent dynamics, theoretical analysis, adversarial machine learning, label poisoning
Submission Number: 1926
Loading