Hierarchical Decision-making via Multi-turn Reinforcement Learning
Keywords: LLM Reasoning, Deep Reinforcement Learning, Offline Reinforcement Learning, Multi-turn, Hierarchical Decision-making
Abstract: Large language models (LLMs) often struggle with tasks requiring deep reasoning and multi-turn decision-making. Reinforcement learning (RL) has thus emerged as a complementary approach, enabling LLM-based agents to optimize sequential decision-making while leveraging their broad knowledge. In particular, offline RL improves sample efficiency and supports generalization by exploiting pre-collected datasets. Although offline RL provides a practical framework for training LLM-based agents, it often struggles in sparse reward environments where long-term planning is critical. These limitations motivate hierarchical decision-making, inspired by dual-process theory. In this paradigm, System 1 generates high-level goals, while System 2 executes detailed actions to achieve them. This division of roles enables each system to focus on its specialized function, enhancing efficiency and performance in multi-turn tasks.
This work aims to develop $\texttt{Multi}^2$, a hierarchical LLM-based agent framework designed to handle multi-turn tasks (Figure 1). $\texttt{Multi}^2$ integrates SFT for stable initialization with offline RL for robust sequential decision-making. We denote the agents as $\texttt{System 1}$ and $\texttt{System 2}$, where $\texttt{System 1}$ receives the overall task as input and generates a sequence of high-level goals, while $\texttt{System 2}$ executes single-step actions aligned with each high-level goal. To facilitate efficient model updates, both agents share a common base model but employ distinct low-rank adaptation matrices.
Training proceeds in two phases. First, both agents employ SFT using a pre-collected dataset, providing a reliable initialization and guiding task-specific output formats. Second, after the SFT phase, we introduce an offline RL-based LLM fine-tuning to ensure accurate decision-making in multi-turn tasks. $\texttt{System 1}$ retains its SFT-trained policy, as only sparse feedback is available. $\texttt{System 2}$ adopts an actor-critic architecture with implicit Q-learning to stabilize training and mitigate out-of-distribution issues. The Q-function estimates action returns given the current observation, and the value function predicts observation returns.
We evaluate $\texttt{Multi}^2$ on the ScienceWorld benchmark [1] compared with ReAct [2] and Reflexion [3]. Table 1 reports model accuracy (Acc) and failures (Fail) across three topics, evaluated over 10 random seeds. The topics are (1) Mendelian Genetics of Biology, (2) Identification of Biology, and (3) Classification. Acc is measured per task, with 100% for full success and partial credit for achieving only the high-level goal. The proposed framework consistently outperforms all baselines, achieving more than 40\% relative improvement over the baselines.
Submission Number: 110
Loading