Keywords: VLA model acceleration; robotics
TL;DR: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning
Abstract: Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing methods focus on local information from the current action step and ignore the global context, leading to $>20$% success rate drop and limited speedup in some scenarios. In this paper, we point out **spatial-temporal consistency** in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. **(1) Action-Level Static Pruning** We leverage global history and local attention to statically reduce visual tokens per action. **(2) Layer-Level Dynamic Pruning** We prune tokens adaptively per layer based on layer-wise importance. **(3) Lightweight Action-Aware Controller** We classify actions as coarse- or fine-grained by the speed of the end effector. Fine-grained actions are pruning-sensitive, so the controller adjusts pruning aggressiveness accordingly. Extensive experiments show that, compared to the high-performing VLA model OpenVLA-OFT, SpecPrune-VLA achieves up to **1.57$\times$** speedup in the LIBERO simulation benchmark across different hardware configurations, and an average speedup of **1.70$\times$** in real-world robotic tasks, with negligible degradation in task success rate.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 7336
Loading