Perception Through Sparsity: Fusing and Enhancing Multi-Agent Sparse Representation with Deformable Cross-Attention

Zonglin Meng; Zhaoliang Zheng; Yuxin Bao; Johnson Liu; Jiaqi Ma

Perception Through Sparsity: Fusing and Enhancing Multi-Agent Sparse Representation with Deformable Cross-Attention

Zonglin Meng, Zhaoliang Zheng, Yuxin Bao, Johnson Liu, Jiaqi Ma

15 Sept 2025 (modified: 01 May 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Autonomous Driving; Perception; Multi-Agent Perceptioin;

Abstract: Multi‑agent perception has gained significant attention for its ability to share information among connected automated vehicles (CAVs) and smart infrastructure, thus mitigating occlusions and extending effective sensing range. Despite this progress, research on radar-based cooperative perception has been constrained by limited datasets, where existing benchmarks either provide only partial radar views or a small number of frames, making it difficult to fully study radar’s potential in V2X perception. To address this gap, we introduce V2XSet-R, the first large-scale dataset that provides complete 360 degree radar coverage from both vehicles and infrastructure, with 150k radar frames and 170k annotated 3D bounding boxes. This dataset significantly expands the scale and diversity of radar data, enabling systematic study of radar-based cooperative perception and fusion. Building on this resource, we propose SparseFusion, a dual-stage fusion framework tailored to sparse multi-agent perception. Unlike prior position-wise self-attention designs that compute affinity scores only among voxels at the same BEV location, SparseFusion aggregates cross-voxel context via a query-based deformable attention module that adaptively samples informative regions across space and agents. This design overcomes sparsity-induced degeneration and enhances feature interaction across agents , and effectively generalizes to camera BEV features. These results demonstrate that SparseFusion is a precise, efficient, and modality‑agnostic fusion method for cooperative perception.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6367

Loading