Explorations of Self-Repair in Language Model

Cody Rushing; Neel Nanda

Explorations of Self-Repair in Language Model

Cody Rushing, Neel Nanda

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Self-Repair, Interpretability, Large Language Models

TL;DR: Self-repair is a phenomena that exists imperfectly when ablating attention heads, and is partially fueled by LayerNorm changes.

Abstract: Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We explore how the final LayerNorm scaling factor can contribute to self-repair, and additionally discuss the implications of these results for interpretability practitioners.

Submission Number: 44

Loading