Keywords: Autonomous driving, diffusion model, end-to-end, transformer
TL;DR: We propose an end-to-end autonomous driving framework based on a unified diffusion-regression-classification policy, achieving state-of-the-art performance on both the CARLA and NAVSIM benchmarks.
Abstract: End-to-end learning has emerged as a transformative paradigm for autonomous driving. However, the inherently multimodal nature of driving behaviors remains a fundamental challenge to robust deployment. We propose DiffE2E, a diffusion-based end-to-end autonomous driving framework. The architecture first performs multi-scale alignment of perception features from multiple sensors via a hierarchical bidirectional cross-attention mechanism. Subsequently, we design a hybrid diffusion-regression-classification decoder based on the Transformer architecture, adopting a collaborative training paradigm to seamlessly fuse the strengths of diffusion and explicit strategies. DiffE2E conducts structured modeling in the latent space: diffusion captures the multimodal distribution of future trajectories, while regression and classification act as explicit strategies to precisely model key control variables such as velocity, enhancing both the precision and controllability of the model. A global condition integration module further enables deep fusion of perception features with high-level goals, significantly improving the quality of trajectory generation. The subsequent cross-attention mechanism facilitates efficient interaction between integrated features and hybrid latent variables, promoting joint optimization of diffusion and explicit strategies for structured output generation and thereby yielding more robust control. Experimental results demonstrate that DiffE2E achieves state-of-the-art performance on both CARLA closed-loop benchmarks and NAVSIM evaluations. The proposed unified framework that integrates diffusion and explicit strategies provides a generalizable paradigm for hybrid action representation and shows substantial potential for extension to broader domains, including embodied intelligence.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 2799
Loading