The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors

The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors

ACL ARR 2026 January Submission4292 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: belief, posterior, mechanistic interpretability, manifolds, linear field probes

Abstract: Large language models represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. To make these questions measurable, we study a controlled numerical setting where Llama-3.2 infers a parametric posterior predictive distribution from stochastic time series, yielding a map from hidden states to output distributions and forming curved ``belief manifolds.'' This lets us follow belief updates as trajectories when the underlying data-generating process switches, and then test interventions that try to move the model's belief along a desired coordinate. We demonstrate that standard linear steering often pushes states off-manifold and induces coupled, out-of-distribution shifts, while geometry- and field-aware steering better preserves the intended belief family.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4292

Loading