The Challenge of Reliable Vision–Language Model Responses in Driving

ICLR 2026 Conference Submission6368 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autonomous Driving, Temporal Reasoning, Reliable Driving Assistant
Abstract: Reliable decision-making relies on both prediction and reasoning. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can genuinely understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without grounded temporal reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables future reasoning and thus ensures reliable decision-making—a claim we critically examine. We identify two major challenges limiting VLM reliability in this setting: response inconsistency—where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing—and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach that improves both consistency and temporal reasoning without requiring temporal labels.
Supplementary Material: pdf
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6368
Loading