Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan

Published: 13 Oct 2025, Last Modified: 07 Mar 2026ICCV 2025EveryoneRevisionsCC BY 4.0

Abstract: Recent advancements in Vision-Language Models (VLMs) have sparked interest in autonomous driving applications, particularly for those that incorporate interpretable human knowledge. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19.200 images, 20498 QA pairs, and four key driving tasks. Our findings reveal that existing VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical autonomous driving applications. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Our study challenges existing evaluation paradigms and provides a road map toward more trustworthy autonomous driving systems.