Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Abstract: The development of robust Autonomous Vehicles (AVs) is currently hampered by a critical scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events, specifically scenarios like erratic jaywalking or complex construction diversions, remains a manual process that is often cost-prohibitive. Existing automated solutions rely either on coarse metadata search, which lacks semantic precision, or they utilize cloud-based Vision-Language Models (VLMs) that introduce significant privacy risks and high latency overheads. In this work, we introduce Semantic-Drive, a local-first, neuro-symbolic framework designed for trustworthy semantic data mining. Our approach decouples perception into two distinct stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis, where a Reasoning VLM performs forensic scene analysis. To effectively mitigate hallucination and reliability issues common in generative models, we implement a "System 2" inference-time alignment strategy that utilizes a multi-model "Judge-Scout" consensus mechanism. When benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, it was observed that Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP). Notably, the system reduces Risk Assessment Error by 40\% compared to single-model baselines. Crucially, the entire pipeline runs on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving and efficient alternative to cloud-native architectures.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhangyang_Wang1
Submission Number: 7073
Loading