Remedying the Curse of Autonomous Driving: VLM Driven Training-Free Framework for Efficient Long-Tail Video Detection

Burhaneddin Yaman; Xinyue Wang; Zhaoyin Jia; Min Cai; Thomas Lampo; Jin Sun; Vivasvat Keswani; Danhua Guo

Remedying the Curse of Autonomous Driving: VLM Driven Training-Free Framework for Efficient Long-Tail Video Detection

Burhaneddin Yaman, Xinyue Wang, Zhaoyin Jia, Min Cai, Thomas Lampo, Jin Sun, Vivasvat Keswani, Danhua Guo

Published: 08 Apr 2026, Last Modified: 08 Apr 2026CVPR 2026 Workshop WDFM-EAI PosterEveryoneRevisionsCC BY 4.0

Keywords: Autonomous Driving

TL;DR: We introduce a training-free two-stage VLM framework for efficiently detecting rare, safety-critical driving scenarios, achieving 24% higher AUC and 14× faster inference.

Abstract: Autonomous driving has made remarkable progress, and we have finally witnessed its real-world deployment. These advancements have been driven by training models on an increasing scale of data. Nowadays, data collection is streamlined, with vast amounts—amounting to $\sim$ 100 years of driving data—collectible in a single day. However, most of this data is routine and does not help the model to generalize; on the contrary, it might bias models towards routine driving scenarios. This creates a critical bottleneck in model generalization, as systems remain vulnerable to long-tail scenarios that are rare but safety-critical. Hence, training models on such scenarios remains critical for safe and scalable deployment. Despite extensive research on end-to-end driving models, a systematic method for efficiently detecting these long-tail events from large-scale dataset is missing. In this work, we propose a novel *training-free* two-stage framework based on vision language models. In the first stage, a small language model is used to process videos for scenario summarization. In the second stage, the scenario summarization from the first stage is processed by a large language model to rank the video's long-tail relevance. Our two-stage framework is designed to efficiently process industry-scale video datasets and accurately classify the relevance of video segments to long-tail events. Experiment results show our proposed method surpasses counterpart methods by 24\% in AUC and provides a $\sim14\times$ faster inference compared to counterpart multimodal large language models, enabling scalable and targeted data mining for autonomous systems, which is critical for generalizing driving models for deployment. We hope our study paves the way for further research in this critical field.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 3

Loading