LSRL: Process-Supervised GRPO on Latent Recurrent States Improves Mathematical Reasoning

LSRL: Process-Supervised GRPO on Latent Recurrent States Improves Mathematical Reasoning

ACL ARR 2025 May Submission7949 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Latent-recurrent language models solve tasks by iteratively refining hidden states rather than emitting chain-of-thought tokens, yet the opacity of those hidden trajectories hinders credit assignment and limits mathematical reasoning accuracy. We propose Latent-State Supervised Reinforcement Learning (LSRL), a process-supervised variant of Guided Reward Policy Optimization that delivers a dense reward at every latent step. Specifically, we decode each of the recurrent depths of a 3.5-billion-parameter Huginn model, score the partial solutions with a GPT-4.1-nano grader aligned to final-answer correctness, and update the policy with LoRA on a single NVIDIA~L40S GPU using only 500 GSM-8K training problems. Relative to the depth-8 supervised Huginn baseline, LSRL improves absolute accuracy by +4.27 points on GSM-8K and +2.06 points on MathQA. These results show that rewarding latent steps is an efficient route to stronger mathematical reasoning in latent-recurrent language models.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Reinforcement Learning, Optimization Methods, Representation Learning, Generative Models, Transfer Learning / Domain Adaptation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7949

Loading