Abstract: With the adoption of autonomous driving systems and scenario-based testing, there is a growing need for efficient methods to understand and retrieve driving scenarios from vast amounts of real-world driving data. As manual scenario selection is labor-intensive and limited in scalability, this study explores the use of three Large Vision-Language Models, CLIP, BLIP-2, and BakLLaVA, for scenario retrieval. The ability of the models to retrieve relevant scenarios based on natural language queries is evaluated using a diverse benchmark dataset of real-world driving scenarios and a precision metric. Factors such as scene complexity, weather conditions, and different traffic situations are incorporated into the method through the 6-Layer Model to measure the effectiveness of the models across different driving contexts. This study contributes to the understanding of the capabilities and limitations of Large Vision-Language Models in the context of driving scenario retrieval and provides im
Loading