Abstract: Cooperative perception has significant potential to enhance perception performance compared to single-agent systems by integrating information from multiple agents through vehicle-to-everything (V2X) communication. However, several challenges hinder the attainment of high performance in cooperative perception, particularly positional errors arising from sensor data collection and time delays during data transmission. Existing research often addresses only one of these issues, making it unsuitable for scenarios where spatial-temporal errors coexist. In this paper, we focus on resolving the spatio-temporal drift issue caused by the interplay of spatial and temporal variations. To address this, we propose a novel end-to-end cooperative perception framework called Multi-frame Grouping Multi-agent Perception (MGMP), which effectively fuses spatio-temporal perception features from multiple agents, including vehicles and road infrastructure. Our approach extracts the effective semantic information of the temporal context of multiple agents, leverage the cross-learning of window information through multi-scale window attention, and group and aggregate multiple agents to simultaneously address the spatio-temporal drift problem caused by positional errors and time delays. We validate the effectiveness of our method on the V2XSet, OPV2V and Dair-V2X datasets. Experimental results indicate that, compared to the state-of-the-art (SOTA) work, our method achieves improvements of 2.7%, 1.7%, and 1.2% on AP@0.7, respectively.
External IDs:dblp:journals/tits/DaiZWWSY26
Loading