Weight-Space Geometry of Offline Reasoning Training

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Feature Geometry, Applications of interpretability, Methods (probing, steering, causal interventions)
TL;DR: Six offline reasoning losses trained on identical data: SFT/RFT/RIFT produce the same weight update (cosine ≥ 0.97), GRPO adds a small orthogonal component, DPO is mechanistically a different algorithm. Most 'offline RL' is SFT in disguise
Abstract: We train six offline reasoning losses on the same data, same model, same everything and look at what they actually do to the weights. Turns out SFT, RFT, and RIFT learn the same update (cosine ≥ 0.97). Filtering or reward-weighting negatives doesn't change the direction, just the step size. DFT diverges more despite being a one-line fix. GRPO adds a real orthogonal component but stays in the same basin. DPO is a genuinely different algorithm orthogonal subspace, loss barrier, CKA collapse and the only one that's meaningfully more accurate (93.5% vs 87–88% GSM8K; 30% vs 3–10% AIME26). Most "offline RL for reasoning" is SFT in disguise.
Submission Number: 414
Loading