BEVDrive-E2E: Imitation With Bird's Eye View Perception for Interpretable End-to-End Autonomous Driving

Jiayuan Du, Yuebing Song, Xianghui Pan, Shuai Su, Jingwei Yang, Liuyi Wang, Chengju Liu, Qijun Chen

Published: 01 Apr 2026, Last Modified: 27 Feb 2026IEEE Robotics and Automation LettersEveryoneRevisionsCC BY-SA 4.0

Abstract: Imitation learning (IL) for end-to-end autonomous driving (E2E-AD) has made great progress recently in the closed-loop evaluation of the CARLA simulator. However, the causal confusion remains an open problem. To address this issue, we propose the BEVDrive-E2E to explore the interpretability of the end-to-end model by using visual abstractions in bird's eye view (BEV). We design a hybrid BEV fusion module (HBFM) that combines the feature aggregation capabilities of CNN and transformer in both local and global areas. To fully exploit the benefits of BEV representation, we perform BEV detection and segmentation to form a unified semantic BEV map and adopt a two-stage training schedule. We leverage the transformer decoder to predict the sequential path points in an autoregressive manner. Our proposed approach shows the state-of-the-art (SOTA) performance on both CARLA leaderboard 1.0 and leaderboard 2.0.

External IDs:doi:10.1109/lra.2026.3662561