Keywords: Autonomous Driving, Simulation Evaluation Metrics, Closed-loop Evaluation, Tail-Risk Assessment
Abstract: Long-horizon closed-loop simulation is increasingly used to evaluate and improve autonomous driving systems, yet its safety realism is still largely judged by benchmark-facing binary collision metrics. We argue that this creates a safety illusion: simulators can achieve near-saturated benchmark scores while still exhibiting substantially different physical risk. We trace this limitation to a fundamental Resolution Gap in current evaluation, where collision outcomes are collapsed into rollout-level binary indicators that capture event frequency but not failure severity. To address this problem, we propose Composite Severity Analysis, a physically grounded framework that decomposes pairwise collisions into relative impact velocity, penetration depth, and contact duration, and aggregates them into a continuous severity representation with taxonomy-aware noise filtering. Building on this formulation, we introduce the Composite Collision Metric (CCM), which jointly captures collision occurrence and consequence severity. Experiments on Waymo Open Sim Agent Challenge (WOSAC) show that simulators with nearly identical benchmark-facing collision performance can exhibit sharply different tail-risk profiles, while safety-oriented improvements that are nearly invisible to standard metrics become clearly measurable under CCM. These results suggest that binary collision likelihood is insufficient for simulator safety evaluation and that severity-aware analysis is necessary for trustworthy safety alignment in autonomous driving simulation.
Submission Number: 7
Loading