Keywords: mechanistic interpretability, representation manifolds, latent state, prompt effects, unsupervised recovery, geometry of representations, large language models
Abstract: Understanding how large language models maintain and manipulate internal task
state remains a central challenge in mechanistic interpretability.
Sequential tasks are known to induce low-dimensional structure in activation
space, yet how task-defined state, representation geometry, and prompt modulation
interact remains poorly understood.
We introduce a formal framework that links task-induced latent states to
latent-aligned manifolds in the residual stream, in which representations
concentrate near low-dimensional trajectories reflecting latent state progression.
Building on this perspective, we develop Auto-Latent, an unsupervised method
that recovers ordered latent-aligned structure directly from activations without
access to task rules or handcrafted annotations.
Across controlled state-tracking tasks, system-level prompts primarily act as
approximately translational offsets on task manifolds, preserving geometric
structure while shifting their embedding in representation space.
These translation-dominated responses and latent-aligned geometries persist under
unsupervised recovery, indicating that prompt modulation and internal task state
are governed by intrinsic geometric properties of model computation rather than
task-specific annotation artifacts.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, knowledge tracing/discovering/inducing, robustness
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English
Submission Number: 1712
Loading