Keywords: Large Language Models, Reasoning, Mechanistic Interpretability, Policy Gradient, Sparse Autoencoder
TL;DR: A method to causally control and interpret LLM reasoning behaviors by identifying and intevening internal reasoning-critical components.
Abstract: Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms that support these behaviors remain opaque, raising concerns regarding truthfulness, safety, and controllability in practical applications.
Existing interpretability approaches either rely on human-annotated contrastive pairs to derive control vectors, which limits reliability and generalization, or identify neurons correlated with superficial textual concepts, failing to capture the complexity of reasoning processes.
Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture causal effects from model internal workings to the reasoning outputs.
In this paper, we build on causality-aware and outcome-oriented principles that focus on identifying components that have causal contributions to reasoning behavior where outcomes are cumulated by long-range effects.
We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model inner workings like neurons, by propagating compound outcome-based signals (e.g., post reasoning accuracy) backward through model inference trajectories.
IPG is efficient requiring only a few calls to the standard gradient operator, which uncovers causal structures governing complex reasoning and avoids large manual supervision.
Empirical evaluations demonstrate that our approach achieves more precise mechanistic interpretability and enables reliable modulation of reasoning behaviors across diverse reasoning models.
Primary Area: interpretability and explainable AI
Submission Number: 15897
Loading