FlashDrive: Flash Vision-Language-Action Inference for Autonomous Driving

Zekai Li; Yihao Liang; Hongfei Zhang; Jian Chen; Zhijian Liu

FlashDrive: Flash Vision-Language-Action Inference for Autonomous Driving

Zekai Li, Yihao Liang, Hongfei Zhang, Jian Chen, Zhijian Liu

Published: 02 Mar 2026, Last Modified: 22 Apr 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: autonomous driving, vision-language-action model, efficient infernece

Abstract: While the recent Alpamayo1 model sets a new baseline for Vision-Language-Action (VLA) models in autonomous driving, its significant inference latency precludes deployment on edge devices. In this work, we systematically analyze performance bottlenecks across each inference stage (encode, prefill, decode, and action) of Alpamayo1-10B, revealing that the model suffers from severe spatial redundancy. To bridge this gap, we propose FlashDriveVLA, an algorithm-system co-design framework that comprehensively addresses the efficiency bottlenecks at each stage. FlashDriveVLA reduces end-to-end latency from 769.2 ms to 158.2 ms (4.9x speedup), successfully bringing the autonomous driving VLA closer to real-time inference on edge hardware.

Submission Number: 83

Loading