Can you steer models towards Introspection?
Keywords: Introspection, Language models, Sparse autoencoders (SAEs), Difference-in-means vectors, Principal Component Analysis (PCA), Activation steering
TL;DR: Lightweight probing methods like PCA and difference-in-means vectors are insufficient to reliably replicate or scale introspection in language models
Abstract: Recent work has demonstrated that language models possess introspective capabilities,
with the underlying mechanisms shown to be largely explained by a single principal component,
identified via sparse autoencoders (SAEs) and fine-tuning \cite{macar2026mechanismsintrospectiveawareness}.
In this work, we investigate whether cheaper and faster alternatives can replicate these
findings, employing difference-in-means vectors and PCA as lightweight probing methods,
alongside activation steering as a low-cost intervention to improve introspection rates.
While we expect such methods to be noisy, they offer a simple and accessible entry point
into the mechanistic study of introspection. Our code is available on \href{https://github.com/ChamodKalupahana/CS2881-Introspection}{Github}.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading