Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

Jintao Tong; Yixiong Zou; Guangyao Chen; Yuhua Li; Ruixuan Li

Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

Jintao Tong, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We find a natural decomposition of ViT outputs, which we use as a novel perspective to interpret the feature entangement problem. Based on it, we further propose a method to self-disentangle ViT features and re-compose them for CD-FSS.

Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a large-scale source-domain dataset to unseen target-domain datasets with limited annotated samples. Current methods typically compare the distance between training and testing samples for mask prediction. However, a problem of feature entanglement exists in this well-adopted method, which binds multiple patterns together and harms the transferability. However, we find an entanglement problem exists in this widely adopted method, which tends to bind source-domain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.

Lay Summary: In many image tasks, like medical imaging or satellite analysis, we often face a challenge: we want to train computers to understand images in one domain (like pictures of dogs) and then apply that knowledge to a very different domain (like X-rays or crop fields), but with only a few examples. This is known as Cross-Domain Few-Shot Segmentation (CD-FSS). Most current approaches try to compare features from training and testing images directly, but we found this often mixes up different patterns from the original training set, making it hard for the model to adapt to new tasks — a problem called “feature entanglement.” To solve this, we take a closer look at how modern vision models (specifically, Vision Transformers) break down image information internally. We discovered a way to measure which parts of the image comparison are meaningful and which are just noise. By teaching the model to weigh these comparisons differently, we help it learn cleaner, more transferable features. Our method improves performance on challenging CD-FSS tasks and outperforms leading models by nearly 2% in accuracy.

Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning

Keywords: Disentanglement, Composition, Few-shot, Transfer Learning

Submission Number: 8945

Loading