DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan K. Karaman; Aditya Rawal; Mohammad Ghavamzadeh; Suhaila Shakiah; Arijit Biswas; Ruida Zhou

DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan K. Karaman, Aditya Rawal, Mohammad Ghavamzadeh, Suhaila Shakiah, Arijit Biswas, Ruida Zhou

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: DISPO decouples clipping for tokens in correct vs. incorrect responses to control four distinct policy update regimes, delivering REINFORCE-like efficiency with PPO-like stability.

Abstract: Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights $>1$ increase the average token entropy (i.e., exploration) while weights $<1$ decrease it (i.e., distillation) - both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights $>1$) or vanishing response lengths (when weights $<1$). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04\% on AIME'24 (vs.\ 55.42\% CISPO and 50.21\% DAPO) with similar gains across various benchmarks and models.

Submission Number: 717

Loading