Planning Beyond Perception: Benchmarking LLM- and VLM-Based Reasoning for Autonomous Driving

Published: 18 Nov 2025, Last Modified: 20 Jan 2026PLAN-FM Bridge @ AAAI 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: autonomous driving, multimodal foundation models, long-horizon planning, vision-language reasoning, safety-aware decision making
Abstract: Recent advances in large multimodal foundation models (LLMs, VLMs, and VLAMs) have demonstrated promising capabilities in perception and reasoning across visual and linguistic modalities. Yet, their effectiveness in longhorizon, safety-critical planning—a core requirement for autonomous driving—remains insufficiently understood. This work presents Planning Beyond Perception (PBP), a benchmark for systematically evaluating the planning and decisionmaking abilities of multimodal foundation models in realistic driving contexts. PBP encompasses tasks requiring situational reasoning under multimodal inputs, plan decomposition and adaptation across dynamic traffic scenarios, and safety-aware control constrained by real-world driving rules. Using standardized environments derived from CARLA and nuScenes, we assess multiple architectures, including LLM-, VLM-, and VLAM-based agents, on their ability to generate interpretable, robust, and executable driving plans. Our findings reveal that while these models excel in short-horizon perception and description, they exhibit significant limitations in causal reasoning, temporal abstraction, and reliable action synthesis. PBP provides an open, reproducible framework to benchmark and advance the development of foundation models for trustworthy autonomous planning.
Submission Number: 9
Loading