Keywords: LLM Reasoning, Deep Reinforcement Learning, Offline Reinforcement Learning, Multi-turn, Hierarchical Decision-making
Abstract: The goal of intelligent agents is to support complex, long-horizon decision-making, yet current large language models (LLMs) struggle with multi-turn interactions. We introduce $\texttt{Multi}^2$, a hierarchical decision-making framework that operationalizes dual-process theory by separating System 1, a planner that generates long-term goals using supervised fine-tuning (SFT), from System 2, an executor that learns sequential decision-making via offline reinforcement learning (RL). Both systems share the same LLM backbone but specialize through distinct low-rank adaptation (LoRA) modules. This design enables sample-efficient training without online interaction and robust multi-turn reasoning through hierarchical decomposition. Experiments show that $\texttt{Multi}^2$ achieves $17.5$% higher performance and $14.2$% higher success rate than the strongest baseline. These results highlight the novelty of $\texttt{Multi}^2$ as the first framework to combine multi-agent LLMs with offline RL, providing a principled path toward scalable, multi-turn intelligent agents.
Submission Number: 250
Loading