GRPO-SoftCoT++:Latent-SpaceContrastiveReinforcementLearning forStableMulti-StepReasoninginLargeLanguageModels

GRPO-SoftCoT++:Latent-SpaceContrastiveReinforcementLearning forStableMulti-StepReasoninginLargeLanguageModels

ACL ARR 2026 January Submission2025 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reinforcement Learning, Multi-step Reasoning, Chain-of-Thought, Latent Space, Contrastive Learning

Abstract: Large Language Models (LLMs) achieve strong surface-level text generation but often struggle with reliable multi-step reasoning, where behavior resembles statistical pattern matching rather than systematic deduction. Reinforcement learning (RL) introduces a promising "think-before-speak" paradigm, yet token-level RL in discrete action spaces suffers from sample inefficiency, high gradient variance, and catastrophic forgetting. We propose GRPO-SoftCoT++, a latent-space contrastive reinforcement learning framework that shifts reasoning exploration from token sequences to a continuous semantic manifold. A lightweight assistant samples multiple latent reasoning trajectories, which are evaluated by correctness- and format-based rewards and selectively decoded by a frozen main model. Group-relative policy optimization ensures stable latent-space learning, while a contrastive objective encourages diverse yet coherent reasoning paths. Experiments on GSM8K and MATH show that GRPO-SoftCoT++ improves Pass@1 accuracy by +4.3% and +7.2% over SoftCoT++, respectively, with more stable convergence under comparable computational budgets, demonstrating the effectiveness of latent-space reinforcement learning for long-horizon reasoning.

Paper Type: Long

Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Research Area Keywords: Reasoning, Reinforcement Learning, Language Models

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2025

Loading