Early Preview Hierarchical GRPO To Boost Reasoning Of Small-Sized Large Language Models

ACL ARR 2025 May Submission3351 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Inference scaling enhances the reasoning capabilities of large language models, with reinforcement learning serving as the key technique to draw out complex reasoning. However, key technical details of state-of-the-art reasoning LLMs—such as those in the OpenAI O series, Claude 3 series, DeepMind’s Gemini 2.5 series, and Grok 3 series—remain undisclosed, making it difficult for the research community to replicate their reinforcement learning training results. We propose an Early Pre-view Hierarchical Reinforcement Learning algorithm based on the open-sourced Group Relative Policy Optimization (GRPO) framework. In details, we introduce an early preview version of a hierarchical reinforcement learning approach that continues to enhance the reasoning capabilities of small-sized large language models. In particular, a 1.5B-parameter LLM achieves 53.3% on AIME and 90.4% on Math500, These results, enabled by the proposed early preview efficient hierarchical reinforcement learning, demonstrate math reasoning capabilities comparable to O1-mini/O3-mini—achievable within a typical school laboratory setting. In addition, we open-source both the dataset and model checkpoints to support future research in large-scale reinforcement learning for LLMs.
Paper Type: Long
Research Area: Generation
Research Area Keywords: reasoning large language model, math reasoning, hierarchical reinforcement learning, reinforcement learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Position papers, Theory
Languages Studied: math reasoning in large language models
Keywords: Math Reasoning, Reinforcement Learning, Large Language Model, Hierarchical Reinforcement Learning
Submission Number: 3351
Loading