On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

ACL ARR 2026 January Submission836 Authors

25 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Fine-Tuning, Embodied Planning, Vision Language Models

Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and verbal goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle in interactive environments. Reinforcement learning (RL) offers a natural way to address this limitation, yet online RL approaches suffer from costly interaction and sparse rewards in embodied settings. This paper introduces an on-policy reinforcement fine-tuning (RFT) framework with offline rewards, that preserves the generalization benefits of RFT while addressing the challenges of costly interaction and sparse rewards, supported by solid theoretical guarantees. Our approach is evaluated on EmbodiedBench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our approach achieves SOTA performance, outperforming all closed-source and online-RL-based methods, while being substantially more efficient in training speed and computational cost, remaining robust to sub-optimal expert trajectories, and exhibiting strong generalization to unseen environments.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, embodied task planning, VLM

Languages Studied: English

Submission Number: 836

Loading