Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

Published: 24 Jun 2024, Last Modified: 14 Jul 2024ICML 2024 MI Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: circuits, faithfulness
TL;DR: We introduce a new automated circuit-finding method, and show that it is more faithful than its predecessor.
Abstract: Many recent language model (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit for a task by performing causal interventions on each edge independently, but this scales poorly with model size. As a solution, recent work has proposed edge attribution patching (EAP), a scalable gradient-based approximation to interventions. In this paper, we introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to efficiently find circuits while better maintaining one of their core properties: faithfulness. A circuit is faithful if all model edges outside the circuit can be ablated without changing the model's behavior on the task; faithfulness is what justifies studying circuits, rather than the full model. Our experiments demonstrate that circuits found using EAP-IG are more faithful than those found using EAP, even though both have high node overlap with reference circuits found using causal interventions. We conclude more generally that when comparing circuits, measuring overlap is no substitute for measuring faithfulness.
Submission Number: 114
Loading