Learning to Drive with Two Minds: A Competitive Dual-Policy Approach in Latent World Models

Xiaoji Zheng; Ziyuan Yang; Yuhang Peng; Yuanrong Tang; Yanhao Chen; Bokui Chen; Jiangtao Gong

Learning to Drive with Two Minds: A Competitive Dual-Policy Approach in Latent World Models

Xiaoji Zheng, Ziyuan Yang, Yuhang Peng, Yuanrong Tang, Yanhao Chen, Bokui Chen, Jiangtao Gong

10 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Autonomous Driving, World Model, Reinforcement Learning

TL;DR: We propose a dual-policy framework that uses a latent world model to combine imitation and reinforcement learning for autonomous driving, improving safety and robustness without external simulators.

Abstract: Recent advances in generative video models such as SORA have renewed interest in using world models to simulate physical dynamics for embodied decision-making tasks like autonomous driving. In parallel, end-to-end driving frameworks have begun to incorporate latent world models that predict future latent states as an auxiliary objective, trained jointly with imitation learning to enhance the model’s planning capabilities. These models help encode environment dynamics and improve planning accuracy, but treat the world model as a passive auxiliary module. Separately, the Dreamer series has demonstrated the potential of using latent world models as simulators for reinforcement learning (RL), enabling agents to learn through imagined rollouts. However, combining imitation learning (IL) and RL in latent world models remains underexplored, and naive attempts to jointly optimize a shared policy often lead to instability and degraded performance. In this work, we propose a dual-policy framework that decouples IL and RL agents while sharing a common latent world model. The IL policy learns from expert driving data using supervised latent rollouts, while the RL policy explores the same latent environment via Dreamer-style training. Rather than fusing the two objectives, the agents are trained independently and compete during learning. Based on the outcome of their competition, knowledge—either expert behavior or exploratory experience—is selectively shared between agents. This architecture enables each policy to specialize while benefiting from the other’s strengths. Experiments in complex driving scenarios demonstrate that our approach outperforms imitation-only baselines, leading to more robust and generalizable autonomous driving policies. We will release our code on GitHub soon.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 17316

Loading