Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Abstract: The development of Autonomous Vehicles (AVs) is currently hampered by a scarcity of long-tail training data. While fleets collect petabytes of video logs, identifying rare safety-critical events, specifically scenarios like erratic jaywalking or complex construction diversions, remains a manual process that is often cost-prohibitive. Existing automated solutions rely either on coarse metadata search, which lacks semantic precision, or on cloud-based Vision-Language Models (VLMs), which introduce privacy concerns and computational overhead. In this work, we introduce Semantic-Drive, a local-first, neuro-symbolic framework designed for verifiable semantic data mining. Our approach decouples perception into two distinct stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis, where a Reasoning VLM performs forensic scene analysis. To reduce hallucinations and reliability issues common in generative models, we implement a "System 2" inference-time alignment strategy that utilizes a multi-model "Judge-Scout" consensus mechanism. When benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, it was observed that Semantic-Drive achieves a recall of 0.966 on safety-critical scenarios (vs. 0.331 for OWL-v2 and 0.271 for Grounding DINO). Notably, the system reduces risk assessment error by 40% compared to single-model baselines. The entire pipeline runs on consumer hardware (NVIDIA RTX 3090), offering an accessible and privacy-preserving alternative to cloud-native architectures.
Submission Type: Long submission (more than 12 pages of main content)
Code: https://github.com/AntonioAlgaida/Semantic-Drive
Assigned Action Editor: ~Zhangyang_Wang1
Submission Number: 7073
Loading