Multimodal Few-Shot Point Cloud Segmentation via Agent Adaptation and Discriminative Deconfusion

ICLR 2026 Conference Submission8297 Authors

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Few-Shot 3D Point Cloud Segmentation; multimodal data; Semantic Agents Correlation Aggregation; Discriminative Deconfusion; Semantic Agents Prototypes Adaptation
Abstract: Few-shot 3D point cloud segmentation (FS-PCS) aims to leverage a limited amount of annotated data to enable the segmentation of novel categories. Most existing studies rely on single-modal point cloud data and have not fully explored the potential of multimodal information. In this paper, we propose a novel FS-PCS framework, Multimodal Agent Adaptation and Discriminative Deconfusion (MAD). MAD incorporates three modalities: images, point clouds, and category text embeddings. To fuse multimodal information, we propose the Multimodal Semantic Agents Correlation Aggregation (M-SACA) module, which fuses multimodal features through agent-level correlation and uses text affinity for category semantic learning. To alleviate semantic gaps between the support set and query set in multimodal features, we propose the Semantic Agents Prototypes Adaptation (SAPA) module, which generates multimodal agents for query and support sets, adjusting prototypes to adapt the query feature space. To alleviate intra-class confusion, we introduce the Discriminative Deconfusion (DD) module, which preserves intra-class consistency through residual adapters and generator weights. Experiments on the S3DIS and ScanNet datasets demonstrate that MAD attains state-of-the-art performance, improving mIoU by 3%–7%. Our method can significantly improve segmentation results and suggest valuable insights for future studies. The code will be publicly available.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8297
Loading