Keywords: autonomous driving, multimodal foundation models, long-horizon planning, vision-language reasoning, safety-aware decision making
Abstract: Recent advances in large multimodal foundation models
(LLMs, VLMs, and VLAMs) have demonstrated promising
capabilities in perception and reasoning across visual
and linguistic modalities. Yet, their effectiveness in longhorizon,
safety-critical planning—a core requirement for autonomous
driving—remains insufficiently understood. This
work presents Planning Beyond Perception (PBP), a benchmark
for systematically evaluating the planning and decisionmaking
abilities of multimodal foundation models in realistic
driving contexts. PBP encompasses tasks requiring situational
reasoning under multimodal inputs, plan decomposition
and adaptation across dynamic traffic scenarios,
and safety-aware control constrained by real-world driving
rules. Using standardized environments derived from
CARLA and nuScenes, we assess multiple architectures, including
LLM-, VLM-, and VLAM-based agents, on their
ability to generate interpretable, robust, and executable driving
plans. Our findings reveal that while these models excel
in short-horizon perception and description, they exhibit significant
limitations in causal reasoning, temporal abstraction,
and reliable action synthesis. PBP provides an open, reproducible
framework to benchmark and advance the development
of foundation models for trustworthy autonomous planning.
Submission Number: 9
Loading