Keywords: Large language model, World Model, Reinforcement learning
Abstract: Reinforcement learning (RL) agent relies on experiential data obtained through environmental interactions to achieve task objectives, but the efficacy of such learning paradigms remains fundamentally bounded by the high cost of the interactions. Current researches advocate for the integration of large language models (LLMs) to enhance decision making. However, previous methods predominantly require domain-specific fine-tuning of foundation models, incurring extra computational costs with limited generalization capabilities. Some others rely on control primitives and often produce rigid plans that fail to adapt to environmental dynamics. To this end, we introduce LWM-DPO: LLM-Based World Model with Distilled Policy Optimization in Task Planning, a few-shot framework that uses LLMs' autoregressive reasoning with lightweight policy distillation. Our method decouples state and action representations through the separation of planning and execution. LLM is only used to plan the trajectory of the state, enabling LLMs to generate consistent trajectories via programmatic feedback. Experiments conducted with challenging robotics tasks demonstrate superior sample efficiency (11.3× improvement over baseline RL after 10k steps) and task success rates (91.2\% vs. 8.1\% in dynamic environments).
Submission Number: 17
Loading