No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models

Fan Li; Xuan Wang; Xuanbin Wang; Zhaoxiang Zhang; Yuelei Xu

No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models

Fan Li, Xuan Wang, Xuanbin Wang, Zhaoxiang Zhang, Yuelei Xu

Published: 18 Sept 2025, Last Modified: 21 Apr 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: domain generalization, 3D semantic segmentation, diffusion model

Abstract: Enhancing the cross-domain generalization of 3D semantic segmentation is a pivotal task in computer vision that has recently gained increasing attention. Most existing methods, whether using consistency regularization or cross-modal feature fusion, focus solely on individual objects while overlooking implicit semantic dependencies among them, resulting in the loss of useful semantic information. Inspired by the diffusion model's ability to flexibly compose diverse objects into high-quality images across varying domains, we seek to harness its capacity for capturing underlying contextual distributions and spatial arrangements among objects to address the challenging task of cross-domain 3D semantic segmentation. In this paper, we propose a novel cross-modal learning framework based on diffusion models to enhance the generalization of 3D semantic segmentation, named XDiff3D. XDiff3D comprises three key ingredients: (1) constructing object agent queries from diffusion features to aggregate instance semantic information; (2) decoupling fine-grained local details from object agent queries to prevent interference with 3D semantic representation; (3) leveraging object agent queries as an interface to enhance the modeling of object semantic dependencies in 3D representations. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art performance across multiple benchmarks in different task settings. Code is available at \url{https://github.com/FanLiHub/XDiff3D}.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 8228

Loading