Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Published: 01 Mar 2026, Last Modified: 01 Mar 2026TTU at ICLR 2026 (Main)EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (**PLD**), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection: train residual specialists to handle failure cases, collect diverse data via hybrid policy rollout, and distill the resulting trajectories back into the generalist via SFT. We evaluate **PLD** across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates a 100% success rate on real-world Franka arm and YAM arm dexterous manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models. More results can be found on [website](http://anonymous-pld.github.io/).
Submission Number: 50
Loading