Local Causal Attribution of Chain-of-Thought Reasoning

Published: 11 Jun 2026, Last Modified: 20 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Methods (probing, steering, causal interventions), Attribution Graphs
Other Keywords: chain-of-thought, local explanation, attribution
TL;DR: We propose a black-box method for attributing chain-of-thought steps to prior steps using a structural causal model.
Abstract: Understanding the causal structure of a language model's thought process is a problem of significant importance for both transparency and safety. In this work, we take a *local* approach toward this goal by analyzing the causal relationships among individual components, termed units, of a given, *specific* chain-of-thought trace. We construct a structural causal model on these units and relate each unit to the log probability of generating (subsequent) output units. Our algorithm, termed AttriCoT, is a black-box method that performs attribution by estimating importance parameters in the structural causal model using $O(U)$ forward passes through the model, where $U$ is the number of units. Evaluation of perturbation curves across 5 datasets and 4 reasoning models shows that AttriCoT produces attributions that are more faithful to the model's behavior than alternative methods. The attribution results also reveal notable differences in thought structure between models and domains.
Submission Number: 208
Loading