An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, language models, adversarial examples, activation patching, logit attribution

TL;DR: We show an example of memory management in a transformer model that can cause direct logit attribution techniques to produce misleading interpretations.

Abstract: Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Submission Number: 81

Loading