Towards Scenario Retrieval of Real Driving Data with Large Vision-Language Models

Tin Stribor Sohn; Maximilian Dillitzer; Lukas Ewecker; Tim Brühl; Robin Schwager; Lena Dalke; Philip Elspas; Frank Oechsle; Eric Sax

Towards Scenario Retrieval of Real Driving Data with Large Vision-Language Models

Tin Stribor Sohn, Maximilian Dillitzer, Lukas Ewecker, Tim Brühl, Robin Schwager, Lena Dalke, Philip Elspas, Frank Oechsle, Eric Sax

Published: 01 Jan 2024, Last Modified: 05 Mar 2025VEHITS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the adoption of autonomous driving systems and scenario-based testing, there is a growing need for efficient methods to understand and retrieve driving scenarios from vast amounts of real-world driving data. As manual scenario selection is labor-intensive and limited in scalability, this study explores the use of three Large Vision-Language Models, CLIP, BLIP-2, and BakLLaVA, for scenario retrieval. The ability of the models to retrieve relevant scenarios based on natural language queries is evaluated using a diverse benchmark dataset of real-world driving scenarios and a precision metric. Factors such as scene complexity, weather conditions, and different traffic situations are incorporated into the method through the 6-Layer Model to measure the effectiveness of the models across different driving contexts. This study contributes to the understanding of the capabilities and limitations of Large Vision-Language Models in the context of driving scenario retrieval and provides im

Loading