MCoT: Multi-Modal Vehicle-to-Vehicle Cooperative Perception with Transformers

Shanwei Shi, Chaokun Zhang, Aojia Lv, Shen He

Published: 2023, Last Modified: 13 Nov 2024ICPADS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highly accurate perception is one of the pivotal factors for the safe operation of Intelligent Connected Vehicles (ICVs). Nevertheless, occlusion blind spots, limited fields-of-view, and low-point density of the sensor data lead to limited perception for the single ICV, which can be well addressed with vehicle-to-vehicle (V2V) cooperative perception. Recent development of V2V cooperative perception technology have made the perception of ICVs more and more accurate and reliable. In V2V cooperative perception, LiDAR and Camera are two types of complementary sensors for ICVs. However, using only specific single-modal data such as Camera RGB images or LiDAR point clouds for V2V collaborative perception cannot fully improve perception accuracy. Furthermore, the large model based on Transformers has been proven to effectively enhance multi-modal fusion. To this end, we propose MCoT, a novel approach for multi-modal V2V cooperative perception with Transformers. Our MCoT extracts intermediate features from RGB images and point cloud of different agents, aligning them in the Bird’s-Eye View (BEV) perspective through rigid association. Subsequently, we use the cross-attention mechanism to perform a soft fusion of these features in the BEV domain. The attention mechanism empowers our model with the ability to adaptively discern which regions of the image and LiDAR are most relevant, and what information should be extracted from them. Extensive evaluations demonstrate that MCoT can significantly enhance the accuracy and robustness of perception. Our model achieved remarkable results on the large-scale simulation dataset OPV2V, improving the average accuracy by 71.43% compared to the baseline and outperforming the second-place by nearly 3.95%. Our approach also demonstrates the fastest convergence rate under the same number of training epochs.