LMGenDrive: LLM Reasoning Meets World Models for End-to-End Driving

Hao Shao; Letian Wang; Yang Zhou; Yuxuan Hu; Zhuofan Zong; Steven L. Waslander; Wei Zhan; Hongsheng Li

LMGenDrive: LLM Reasoning Meets World Models for End-to-End Driving

Hao Shao, Letian Wang, Yang Zhou, Yuxuan Hu, Zhuofan Zong, Steven L. Waslander, Wei Zhan, Hongsheng Li

14 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: end-to-end autonomous driving, large language models, world models, video generation, closed-loop

Abstract: Recent years have witnessed remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains the primary bottleneck for large-scale deployment. To address this, one line of research explores LLMs and VLMs for their vision-language understanding and reasoning capabilities, equipping AVs with the ability not only to interpret rare and safety-critical situations when generating driving actions. In parallel, another line investigates generative world models to capture the spatio-temporal evolution of driving scenes, enabling agents to imagine and evaluate possible futures before acting. Inspired by human intelligence, which seamlessly unites understanding and imagination as a hallmark of AGI, this work explores a unified model that brings these two capabilities together for autonomous driving. We present LMGenDrive, the first framework that unifies LLM-based multimodal reasoning with generative world models for end-to-end closed-loop autonomous driving. Given multi-view camera inputs and natural-language instructions, our model generates both realistic future driving videos and corresponding control signals. By coupling an LLM with generative video capabilities, LMGenDrive gains complementary benefits: future video prediction enhances the LLM's spatio-temporal scene understanding, while the LLM itself provides reasoning and instruction-following capabilities. A progressive three-stage training strategy—ranging from vision pretraining to multi-step long-horizon driving—is proposed to further improve stability and performance. The resulting model can also operate in two complementary modes: low-latency online planning and autoregressive offline video generation. Experiments show that LMGenDrive significantly outperforms state-of-the-art methods on challenging closed-loop driving benchmarks, improving instruction following, spatio-temporal reasoning, and robustness to rare scenarios. Our work not only sets a new state-of-the-art in autonomous driving, but also demonstrates that unifying multimodal understanding and generation offers a foundational new paradigm toward achieving embodied AGI.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 5110

Loading