Keywords: Reinforcement Fine-Tuning, Embodied Planning, Vision Language Models
Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.
While recent vision-language models (VLMs) excel at static perception tasks, they struggle in interactive environments.
In this work, we introduce an on-policy reinforcement fine-tuning framework with offline rewards, that preserves the generalization benefits of RFT while addressing the challenges of sparse rewards and costly interaction, supported by solid theoretical guarantees.
Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios.
Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments.
This work highlights the potential of reinforcement-driven reasoning to advance multi-step planning in embodied AI.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 3246
Loading