CEED-VLA: Consistency Vision-language-action Model with Early-exit Decoding

14 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language-Action Model, Model Acceleration
TL;DR: An efficient VLA inference acceleration method that combines consistency distillation, mixed-label supervision, and early-exit decoding to predict multiple action tokens simultaneously, achieving SOTA-level speedup (>4×) with comparable performance.
Abstract: The practical deployment of Vision-Language-Action (VLA) models is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address this problem, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. While the above two methods bring obvious speedup, we identify that certain inefficient iterations remain a critical limitation. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4× inference acceleration across different base models while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 5228
Loading