CodeAlign: Resolving Modality Isolation in Heterogeneous Collaborative Perception

Changxing Liu; Zichen Chao; Siheng Chen

CodeAlign: Resolving Modality Isolation in Heterogeneous Collaborative Perception

Changxing Liu, Zichen Chao, Siheng Chen

12 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: collaborative perception, deep learning, multimodal fusion, heterogeneity

Abstract: Collaborative perception leverages data exchange among multiple agents to overcome the perception limitation of individual agents, significantly enhancing overall perception capabilities. However, heterogeneity brings domain gaps among agents, hindering the collaboration. The heterogeneity is further compounded by an underexplored problem, modality isolation, where the absence of co-occurring data across certain modalities leads to even bigger domain gaps and limits feature alignment approaches. To address this problem, we propose CodeAlign, the first framework to systematically resolve modality isolation in heterogeneous collaborative perception. The key idea is to partition modalities into groups based on whether they have isolation or not, and apply customized strategies for intra-group and inter-group alignment. For intra-group alignment, CodeAlign introduces code space formation that constructs a shared discrete feature space using a codebook, enabling effective feature alignment and efficient communication. For inter-group alignment, CodeAlign introduces code space translation that establish mappings between code spaces, facilitating efficient and dynamic feature transfer. A lightweight Unified Code Translator is designed to perform convenient one-to-many code translation, controlled by conditional embeddings. Experiments show that CodeAlign reduces training parameters by 92\% when integrating 4 new modalities, and achieving 1024× lower communication volume, while maintaining on-par perception performance with SOTA methods. The code will be released.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4350

Loading