Zero-Shot Quantization for Vision-Language-Action Models via Trajectory Curvature and Attention Guidance
Keywords: Vision-Language-Action Model, Zero-Shot Quantization, Flow Matching
TL;DR: We propose a Zero-Shot Quantization (ZSQ) framework for VLA models, generating synthetic calibration data via Flow Matching trajectory curvature and attention-guided masking—no original data needed.
Abstract: Recently, Vision-Language-Action (VLA) models have advanced Embodied AI by integrating LLMs' reasoning into robotic control.
While state-of-the-art VLAs combine Large Vision-Language Models (LVLMs) with Diffusion Transformers (DiTs), their substantial memory and computational overhead limit deployment on edge devices.
Moreover, existing optimization techniques often require training data, which is frequently inaccessible due to privacy concerns.
We introduce—to the best of our knowledge—the first Zero-Shot Quantization (ZSQ) framework for VLA models.
By exploiting Flow Matching characteristics, we employ trajectory curvature and an attention-guided masking strategy to generate synthetic calibration data without any access to the original datasets.
Our method reduces the memory footprint of the quantized components in $\pi_{0.5}$ and NVIDIA GR00T N1.5 by 70\% and 55\%, respectively, under the W4A8 setting, while retaining success rates comparable to data-dependent quantization methods.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 188
Loading