Keywords: Large Language Models (LLMs), Planning, reinforcement learning, interactive, embodied
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet they struggle with agent task planning in dynamic environments requiring continuous observation and sequential decision-making. Current methods generate static action sequences from pre-trained knowledge without learning from environmental feedback, limiting their effectiveness in partially observable settings. We present Interactive Planner-R1, a novel trajectory-level reinforcement learning framework that enables LLMs to develop interactive planning capabilities through autonomous environmental exploration. Our approach addresses three key challenges: (1) limited exploration diversity by introducing multi-trajectory autonomous exploration through parallel group rollouts, (2) sparse reward signals by developing a completion-driven reward architecture that promotes genuine environmental understanding, and (3) single-step optimization constraints by proposing Interactive Policy Optimization (IPO) that extends group-relative policy optimization for multi-step trajectory learning. Extensive experiments on ALFWorld and ScienceWorld demonstrate that Interactive Planner-R1 achieves substantial improvements over existing approaches, reaching 97.55\% completion rate on ALFWorld and 79.92\% on ScienceWorld, with strong generalization exhibiting only 3.33\% performance gap in unseen environments. Our work establishes a new paradigm for LLM-based interactive planning through trajectory-level policy learning.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Large Language Models (LLMs), Planning, reinforcement learning, interactive, embodied
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6220
Loading