Show me your circuit and I will examine your insides: Adversarial Analysis of LLM circuits

Show me your circuit and I will examine your insides: Adversarial Analysis of LLM circuits

ACL ARR 2025 February Submission4170 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Autoregressive language models are vulnerable to adversarial attacks, yet their underlying mechanistic behaviors under such perturbations remain unexplored. We propose a systematic approach to analyzing adversarial robustness, focusing on TextBugger attacks across three mechanistic tasks (IOI, CO, CC). Our study introduces methods for assessing adversarial influence on circuits and reveals characteristic activation patterns. We show that circuit-informed attacks can be more effective than random perturbations, highlighting the potential of circuit knowledge for designing adversarial attacks.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: LLM,Mechanistic Interpretability,Adversarial Examples

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4170

Loading