The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors
Keywords: belief, posterior, mechanistic interpretability, manifolds, linear field probes
Abstract: Large language models represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. To make these questions measurable, we study a controlled numerical setting where Llama-3.2 infers a parametric posterior predictive distribution from stochastic time series, yielding a map from hidden states to output distributions and forming curved ``belief manifolds.'' This lets us follow belief updates as trajectories when the underlying data-generating process switches, and then test interventions that try to move the model's belief along a desired coordinate. We demonstrate that standard linear steering often pushes states off-manifold and induces coupled, out-of-distribution shifts, while geometry- and field-aware steering better preserves the intended belief family.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4292
Loading