Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

ICLR 2026 Conference Submission17844 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language-Action Models, Robotic Manipulation, Fine-Tuning, Preference Optimization, Reinforcement Learning, Stage-Aware Optimization

TL;DR: We propose Stage-Aware Optimization, which improves VLA fine-tuning by decomposing manipulation into stages for precise offline preference alignment and stage-conditioned online policy refinement.

Abstract: Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown re- markable progress in robotic manipulation. Existing methods often treat long- horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Pol- icy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved de- spite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage opti- mization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation→Preference→Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of- the-art success rates of 98.0% on SimplerEnv and 96.4% on ManiSkill3 tasks. Our code will be released publicly.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 17844

Loading