V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving
Keywords: Vehicle-to-Everything (V20X), Knowledge-driven Autonomous Driving, Multimodal Data Fusion, Vision Language Models (VLMs), Semantic Reasoning, Retrieval-Augmented Generation (RAG)
TL;DR: V2X-UniPool transforms multimodal V2X data into a language-based knowledge pool, enabling vehicle models to perform structured, real-time reasoning for autonomous driving via a RAG mechanism.
Abstract: Autonomous driving (AD) has achieved significant progress, yet single-vehicle perception remains constrained by sensing range and occlusions. Vehicle-to-Everything (V2X) communication addresses these limits by enabling collaboration across vehicles and infrastructure, but it also faces heterogeneity, synchronization, and latency constraints. Language models offer strong knowledge-driven reasoning and decision-making capabilities, but they are not inherently designed to process raw sensor streams and are prone to hallucination. We propose V2X-UniPool, the first framework that unifies V2X perception with language-based reasoning for knowledge-driven AD. It transforms multimodal V2X data into structured, language-based knowledge, organizes it in a time-indexed knowledge pool for temporally consistent reasoning, and employs Retrieval-Augmented Generation (RAG) to ground decisions in real-time context. Experiments on the real-world DAIR-V2X dataset show that V2X-UniPool achieves state-of-the-art planning accuracy and safety while reducing communication cost by more than 80%, achieving the lowest overhead among evaluated methods. These results highlight the promise of bridging V2X perception and language reasoning to advance scalable and trustworthy driving.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 1
Loading