On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

Di Wu; Jiaxin Fan; Junzhe Zang; Guanbo Wang; Wei Yin; Wenhao Li; Bo Jin

On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

Di Wu, Jiaxin Fan, Junzhe Zang, Guanbo Wang, Wei Yin, Wenhao Li, Bo Jin

09 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Fine-Tuning, Embodied Planning, Vision Language Models

Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle in interactive environments. In this work, we introduce an on-policy reinforcement fine-tuning framework with offline rewards, that preserves the generalization benefits of RFT while addressing the challenges of sparse rewards and costly interaction, supported by solid theoretical guarantees. Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance multi-step planning in embodied AI.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 3246

Loading