SAR: Scene-Action Representation for End-to-End Autonomous Driving

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: autonomous driving, scene representtation
Abstract: End-to-end autonomous driving systems have made remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. However, most existing methods either rely heavily on dense intermediate supervision (e.g., segmentation and mapping) or neglect behavior modeling, which leads to significant trajectory deviations and safety risks in highly interactive scenarios. To address these challenges, we propose SAR, a novel end-to-end scene action representation framework that enhances sparse scene modeling through structured behavior injection. Inspired by human driving cognition, SAR decomposes the scene into three complementary components: sparse scene semantics, ego-action awareness, and multi-agent action awareness. These components are fused via a specially designed Scene-Action Transformer to produce a consistent, interpretable, and interaction-aware representation for high-quality trajectory planning. Unlike prior approaches, SAR achieves strong generalization in highly interactive urban scenarios with only a small annotation cost. Experimental results on the nuScenes benchmark show that SAR reduces L2 trajectory error by 47% and collision rate by 41% compared to VAD. It also demonstrates superior robustness on NAVSIM and Bench2Drive, achieving new state-of-the-art performance in both open-loop and closed-loop evaluations. The code will be released soon.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6832
Loading