Interactive Post-Training for Vision-Language-Action Models

Interactive Post-Training for Vision-Language-Action Models

ICLR 2026 Conference Submission2218 Authors

05 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, VLA

Abstract: We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. Without requiring shaped rewards or value models, RIPT-VLA achieves state-of-the-art results across a wide range of tasks and benchmarks. It improves the lightweight QueST model by up to 21.2% in few-shot settings, achieving state-of-the-art 94.3% on LIBERO-90, and pushes the large-scale OpenVLA-OFT model to achieve 97.5% on the LIBERO 4-Suite benchmark. Remarkably, when only one demonstration is given, RIPT-VLA enables an unworkable SFT model (4%) to succeed with 97% success rate within 15 iterations. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision. Code and checkpoints will be released (We included anonymous code in the supplementary material for review).

Primary Area: applications to robotics, autonomy, planning

Supplementary Material: zip

Submission Number: 2218

Loading