Abstract: Autoregressive language models are vulnerable to adversarial attacks, yet their underlying mechanistic behaviors under such perturbations remain unexplored. We propose a systematic approach to analyzing adversarial robustness, focusing on TextBugger attacks across three mechanistic tasks (IOI, CO, CC). Our study introduces methods for assessing adversarial influence on circuits and reveals characteristic activation patterns. We show that circuit-informed attacks can be more effective than random perturbations, highlighting the potential of circuit knowledge for designing adversarial attacks.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM,Mechanistic Interpretability,Adversarial Examples
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4170
Loading