Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan; Kairan Dou; Yue Zhao; Philipp Kraehenbuehl

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Kraehenbuehl

Published: 28 May 2025, Last Modified: 28 Jun 2025FMEA @ CVPR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, VLA

TL;DR: We present RIPT-VLA, a reinforcement interactive post-training paradigm for Vision-Language-Action (VLA) models.

Abstract: We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. Without requiring shaped rewards or value models, RIPT-VLA achieves state-of-the-art results across a wide range of tasks and benchmarks. It improves the lightweight QueST model by up to 21.2% in few-shot settings, achieving state-of-the-art 94.3% on LIBERO-90, and pushes the large-scale OpenVLA-OFT model to achieve 97.6% on the LIBERO 4-Suite benchmark. Remarkably, when only one demonstration is given, RIPT-VLA enables an unworkable SFT model (4%) to succeed with 97% success rate within 15 iterations. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision. Code and checkpoints will be released.

Publishedpaper: N/A

Submission Number: 23

Loading