Multi-Level Multi-Turn RL Outperforms GRPO: Reasoning with Textual Feedback

Utsav Singh; Sidhaarth Sredharan Murali; Souradip Chakraborty; Danush Khanna; Mubarak Shah; Amrit Singh Bedi

Multi-Level Multi-Turn RL Outperforms GRPO: Reasoning with Textual Feedback

Utsav Singh, Sidhaarth Sredharan Murali, Souradip Chakraborty, Danush Khanna, Mubarak Shah, Amrit Singh Bedi

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: hierarchical reinforcement learning, LLM reasoning, self-correction in LLMs, multi-turn RL

TL;DR: MLMT-RL is a multi-level multi-turn approach that decomposes reasoning into higher-level feedback generation and lower-level response refinement, outperforming GRPO-based models on three benchmarks.

Abstract: Reinforcement learning with verifiable rewards has become the standard for training reasoning models, with Group Relative Policy Optimization (GRPO) achieving remarkable performance across mathematical, coding, and scientific domains. However, these approaches suffer from severe sample inefficiency due to sparse binary rewards, where even partially correct responses receive zero reward, providing no learning signal and causing extremely slow convergence. We propose Multi-Level Multi-Turn Reinforcement Learning (MLMT-RL), a novel framework that addresses this limitation by leveraging textual feedback to provide dense, interpretable learning signals. MLMT-RL decomposes reasoning into two synergistic levels: a higher-level policy generates task-specific contextual feedback, while a lower-level policy produces refined responses conditioned on this feedback. To ensure effective coordination between guidance generation and execution, we formulate a principled bi-level optimization framework where the higher-level policy is regularized by the lower-level value function. Additionally, we introduce novel metrics to evaluate feedback quality and utilization effectiveness. Our results demonstrate superior parameter efficiency: MLMT-RL with 2B parameters outperforms 3B GRPO models by 3.13% on MATH500, 5.18% on MBPP, and 4.77% on GPQA. Similarly, our 6B model surpasses 7B GRPO models by 3.0%, 2.8%, and 5.7% respectively. MLMT-RL thus establishes a highly efficient paradigm that delivers superior reasoning performance with significantly fewer parameters.

Primary Area: reinforcement learning

Submission Number: 14883

Loading