Keywords: VLA Models, Reinforcement Learning, Bimanual Manipulation, Robot Learning
TL;DR: SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges:
(i) the scarcity and high cost of large-scale robotic trajectories required for SFT scaling,
and (ii) limited generalization to tasks under distribution shift.
To overcome these limitations, we explore reinforcement learning (RL) as a pathway to scaling VLA training beyond limited datasets.
Inspired by LLM breakthroughs where RL with outcome rewards enhances step-by-step reasoning, we ask: Can outcome-driven RL improve long-horizon step-by-step action planning of VLA?
In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models.
Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation.
Applied to OpenVLA-OFT, SimpleVLA-RL achieves 99\% of SoTA performance on LIBERO and 80\% relative
improvement on RoboTwin 1.0\&2.0, outperforming $\pi_0$ with our proposed exploration-enhancing strategies.
SimpleVLA-RL reduces dependence on large-scale data, enables robust generalization, and remarkably surpasses SFT in real-world tasks.
Moreover, we identify a novel phenomenon "pushcut'' during RL training, wherein the policy discovers unseen patterns beyond those seen in previous training process.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 16445
Loading