CoM-V2I: Communication-Efficient Multimodal Cooperative Perception via Codebook Pruning and Multiscale Fusion
Keywords: Cooperative perception, BEV representation, Multimodality, Vector quantization
Abstract: Cooperative perception, which fuses sensory information from multiple agents to enhance individual agent's perception ability, has emerged as a promising approach to overcome the limitations of single-agent line-of-sight sensing. However, a significant challenge lies in economically deploying sensors across agents while minimizing communication costs and maintaining strong perception performance. To address this challenge, we propose CoM-V2I, a novel framework for Communication-efficient Multimodal Vehicle-to-Infrastructure (V2I) cooperative perception. In CoM-V2I, the road infrastructure is equipped with a high-resolution LiDAR sensor, while vehicles are fitted with cost-effective multi-view cameras to balance performance with economic feasibility. We introduce a residual vector quantization-based codebook representation method to improve communication efficiency by compressing bird's eye view (BEV) feature maps into lightweight indices before transmission. We also propose a codebook pruning method that reduces codebook size by removing low-importance code vectors and combining high-similarity ones, thereby decreasing communication costs with minimal impact on perception performance. Furthermore, we propose a multiscale fusion mechanism that progressively integrates multimodal BEV feature maps from the infrastructure and vehicles, which have different spatial resolutions in a coarse-to-fine manner. Experimental results on the V2X-Real and V2X-Sim datasets demonstrate that the proposed CoM-V2I framework outperforms existing baselines in terms of perception accuracy and communication efficiency.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20935
Loading