Automated Traffic Scenario Description Extraction Using Video Transformers

Aron Harder, Madhur Behl

Published: 01 Jan 2024, Last Modified: 06 Feb 2025DATE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scenario Description Languages (SDLs) serve as high-level encodings, offering an interpretable representation of traffic situations encountered by autonomous vehicles (AVs). Their utility extends to critical safety analyses, such as identifying analogous traffic scenarios within vast AV datasets, and aiding in real-to-simulation transfers. This paper addresses the challenging task of autonomously deriving SDL embeddings from AV data. We introduce the Scenario2Vector method, leveraging video transformers to automatically detect spatio-temporal actions of the ego AV through front-camera video footage. Our methodology draws upon the Berkeley Deep Drive - eXplanations (BDD-X) dataset. To determine ground truth actions of the ego AV, we employ BERT combined and dependency grammar-based trees, utilizing the resulting labels for Scenario2.Vector training. Our approach is benchmarked against a 3D convolution (C3D)-based method and a transfer-learned video transformer (ViViT) model, evaluating both extraction accuracy and scenario retrieval capabilities. The results reveal that Scenarioz'Vector is highly effective in detecting ego vehicle actions from video input, adeptly handling traffic scenarios with multiple ego vehicle maneuvers.