GRPO-SoftCoT++:Latent-SpaceContrastiveReinforcementLearning forStableMulti-StepReasoninginLargeLanguageModels
Keywords: Large Language Models, Reinforcement Learning, Multi-step Reasoning, Chain-of-Thought, Latent Space, Contrastive Learning
Abstract: Large Language Models (LLMs) achieve strong surface-level text generation but often struggle with reliable multi-step reasoning, where behavior resembles statistical pattern matching rather than systematic deduction. Reinforcement learning (RL) introduces a promising "think-before-speak" paradigm, yet token-level RL in discrete action spaces suffers from sample inefficiency, high gradient variance, and catastrophic forgetting. We propose GRPO-SoftCoT++, a latent-space contrastive reinforcement learning framework that shifts reasoning exploration from token sequences to a continuous semantic manifold. A lightweight assistant samples multiple latent reasoning trajectories, which are evaluated by correctness- and format-based rewards and selectively decoded by a frozen main model. Group-relative policy optimization ensures stable latent-space learning, while a contrastive objective encourages diverse yet coherent reasoning paths. Experiments on GSM8K and MATH show that GRPO-SoftCoT++ improves Pass@1 accuracy by +4.3% and +7.2% over SoftCoT++, respectively, with more stable convergence under comparable computational budgets, demonstrating the effectiveness of latent-space reinforcement learning for long-horizon reasoning.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: Reasoning, Reinforcement Learning, Language Models
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2025
Loading