Introspective Coupling: LMs Explain Themselves Better Than Training Targets

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: introspection, model organisms, faithful explanation, interpretability
Abstract: When does training language models (LMs) on self-explanations yield faithful introspection, rather than superficial imitation? Surprisingly, we find that LMs trained to explain the predictions of similar models frequently produce explanations more faithful to $\textit{their own current behaviors}$ than to those of their training targets. This ``introspective'' coupling between the model's explanations and behaviors occurs only when the training target explanations remain sufficiently similar to model behaviors over the course of training, even as the model's behaviors shift over the course of training. We also show that introspective coupling tracks different $\textit{types}$ of shifts to model behaviors: when explanation training is run concurrently with other post-training that shifts the model's behaviors, explanations track those behavioral shifts without requiring updated supervision. This holds across a diverse range of tasks, including sycophancy and refusal, and is robust to label noise. These results suggest that introspection training is a viable component of post-training: explanation labels need not be continually refreshed, and faithfulness extends to regions of input space not explicitly supervised.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 128
Loading