RFTF: Reinforcement Fine-tuning for Vision-language-action Models with Temporal Feedback

19 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied Intelligence, Vision-Language-Action Model, Reinforcement Fine-tuning
TL;DR: Train a value model based on temporal information to provide dense rewards for the reinforcement fine-tuning of the VLA models.
Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in the field of embodied intelligence, enabling models to follow human instructions to complete complex tasks in physical environments. Existing VLAs are often trained through behavior cloning, which requires expensive data and computational resources and is constrained by human demonstrations. To address this issue, many researchers explore the application of reinforcement fine-tuning to VLAs. However, typical reinforcement fine-tuning methods for VLAs usually rely on sparse, outcome-based rewards, which struggle to provide fine-grained feedback for specific actions within an episode, thus limiting the model's manipulation capabilities and generalization performance. In this paper, we propose RFTF, a novel reinforcement fine-tuning method that leverages a value model to generate dense rewards in embodied scenarios. Specifically, our value model is trained using temporal information, eliminating the need for costly robot action labels. In addition, RFTF incorporates a range of techniques, such as GAE and sample balance to enhance the effectiveness of the fine-tuning process. By addressing the sparse reward problem in reinforcement fine-tuning, our method significantly improves the performance of VLAs, delivering superior generalization and adaptation capabilities across diverse embodied tasks. Experimental results show that VLAs fine-tuned with RFTF achieve new state-of-the-art performance on the challenging CALVIN ABC-D with an average success length of $4.296$. Moreover, RFTF enables rapid adaptation to new environments. After fine-tuning in the D environment of CALVIN for a few episodes, RFTF achieved an average success length of $4.301$ in this new environment.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 15498
Loading