AutoDrive-QA: A Multiple-Choice Benchmark for Vision–Language Evaluation in Urban Autonomous Driving

Published: 30 Sept 2025, Last Modified: 24 Nov 2025urbanai PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Models, Autonomous Driving, Urban Scene Understanding, Multiple-Choice Benchmark, Evaluation Framework
TL;DR: A standardized multiple-choice benchmark for evaluating vision–language models in autonomous driving, supporting more reliable intelligent transportation systems.
Abstract: Evaluating vision–language models (VLMs) in urban driving contexts remains challenging, as existing benchmarks rely on open-ended responses that are ambiguous, annotation-intensive, and inconsistent to score. This lack of standardized evaluation slows progress toward safe and reliable AI for urban mobility. We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into structured multiple-choice questions (MCQs) with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity. This framework enables reproducible and interpretable evaluation of VLMs across perception, prediction, and planning tasks in complex urban scenes. Experiments show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8\% accuracy, and Qwen2-VL models also perform competitively, particularly in multi-view settings. Moreover, traditional metrics such as BLEU and CIDEr fail to distinguish strong from weak models. By providing an objective, domain-grounded evaluation protocol, AutoDrive-QA contributes to more transparent benchmarking of urban AI systems, supporting the development of safer and more trustworthy autonomous driving technologies for smart cities.
Submission Number: 65
Loading