Invariant Reasoning Directions in Latent Trajectories of Language Models

Arun Vignesh Malarkkan; Manan Roy Choudhury; Utkarsh Byahut; Yash Ravindra Charde; Vivek Gupta; Yanjie Fu

Invariant Reasoning Directions in Latent Trajectories of Language Models

Arun Vignesh Malarkkan, Manan Roy Choudhury, Utkarsh Byahut, Yash Ravindra Charde, Vivek Gupta, Yanjie Fu

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Other

Other Keywords: Latent Reasoning Geometry

TL;DR: Latent reasoning trajectories contain stable low-dimensional directions associated with transferable reasoning behavior. TILR identifies and intervenes on these invariant subspaces, reducing paraphrase instability.

Abstract: Latent reasoning models perform multi-step inference directly in hidden-state space, yet the structure of these latent reasoning trajectories remains poorly understood. We show that contrastive refinement signals between stronger and weaker reasoning trajectories exhibit a highly concentrated low-rank structure, while unconstrained latent updates remain sensitive to paraphrases, checkpoint choice, and trajectory perturbations. These observations suggest that latent reasoning trajectories contain stable invariant directions mixed with unstable instance-specific variation. We introduce Trajectory-Invariant Latent Refinement (TILR), a training-free intervention framework for identifying and manipulating stable reasoning directions in latent space. TILR first learns a low-rank invariant subspace from contrastive trajectory differences across inputs, then constrains latent interventions to this subspace while suppressing poorly aligned updates through an adaptive alignment gate. Across six reasoning benchmarks, we find that a small number of latent directions explain most variation between strong and weak reasoning trajectories. Interventions on these directions causally improve reasoning consistency and reduce trajectory instability under paraphrases and perturbations. TILR improves answer consistency under paraphrase by ~10% and reduces latent trajectory variance by up to 50% while preserving reasoning accuracy. These results support a geometric view of latent reasoning in which transferable reasoning behavior emerges from stable low-dimensional structure within hidden-state trajectories.

Submission Number: 526

Loading