Can you steer models towards Introspection?

Published: 10 May 2026, Last Modified: 10 May 2026XTAI-2026 OralEveryoneRevisionsCC BY 4.0
Keywords: Introspection, Language models, Sparse autoencoders (SAEs), Difference-in-means vectors, Principal Component Analysis (PCA), Activation steering
TL;DR: Lightweight probing methods like PCA and difference-in-means vectors are insufficient to reliably replicate or scale introspection in language models
Abstract: Recent work has demonstrated that language models possess introspective capabilities, with the underlying mechanisms shown to be largely explained by a single principal component, identified via sparse autoencoders (SAEs) and fine-tuning \cite{macar2026mechanismsintrospectiveawareness}. In this work, we investigate whether cheaper and faster alternatives can replicate these findings, employing difference-in-means vectors and PCA as lightweight probing methods, alongside activation steering as a low-cost intervention to improve introspection rates. While we expect such methods to be noisy, they offer a simple and accessible entry point into the mechanistic study of introspection. Our code is available on \href{https://github.com/ChamodKalupahana/CS2881-Introspection}{Github}.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading