HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

ICLR 2026 Conference Submission18399 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-task learning, imitation learning, robotic control, VLA, hypernetworks
Abstract: Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization and action generation strategy. We train HyperVLA on the Open X-Embodiment dataset, and evaluate on the SIMPLER benchmark. Compared to existing VLAs, HyperVLA achieves a similar or even higher success rate during evaluation, while significantly reducing inference costs. Notably, compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by $90\times$, and accelerates inference speed by $120\times$.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 18399
Loading