DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Keywords: Vision-language-action Model, Layer Skipping, Robot Manipulation
TL;DR: We propose DySL-VLA, which accelerates vision-language-action models via dynamically applying computation for different action predictions
Abstract: Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance.
We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping.
We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our comprehensive experiments indicate that DySL-VLA surpasses the state of the art, achieving a 2.1\% improvement in success length over Deer-VLA (NeurIPS'24) on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75$\times$ speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on Anonymous Github.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 9019
Loading