Openhelix: Empirical Analysis of Dual-System VLA Models for Robotic Manipulation

ICLR 2026 Conference Submission13392 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLA; Dual system; Dynamic Scenarios
Abstract: Dual-system vision-language-action (VLA) architectures are emerging as a promising approach in embodied intelligence. However, current works lack consistency in training and evaluation protocols across high- and low-level modules, making systematic comparison and rigorous analysis challenging. In this work, we conduct a comprehensive study of core design principles in existing dual-system VLA architectures and introduce DSVLABench, a new suite that covers diverse evaluation scenarios and standardizes the assessment pipeline for various architectures. Our results show that prompt tuning preserves multimodal large language model generalization, fine-tuning from pre-trained policies outperforms training from scratch in policy learning, and pre-aligning projectors with auxiliary dynamic visual tasks significantly enhances latent space training. Additionally, we find that the frequency of high-level updates has minimal impact during asynchronous inference, with latent embeddings remaining robust to dynamic changes. We hope our findings provide practical guidelines for developing more generalizable and robust dual-system VLA models.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 13392
Loading