HERA: Efficient Test-Time Adaptation for Cross-Domain Few-Shot Segmentation with Vision Foundation Models
Keywords: Test-Time Adaptation ;Cross-Domain Few-Shot Segmentation;Vision Foundation Models;Parameter-Efficient Fine-tuning
TL;DR: HERA: a source-free test-time adaptation framework that turns a few labeled supports into reliable guidance for VFMs in CD-FSS.
Abstract: Vision foundation models (VFMs) excel across vision tasks, but applying them to Cross-Domain Few-Shot Segmentation (CD-FSS) faces two key obstacles: (i) pronounced domain shift that misaligns support–query correspondence; and (ii) few-shot supervision that precludes source-data retraining. As a result, frozen-backbone adaptation rarely treats matching risk as a first-class objective, leaving support–query alignment fragile. We introduce Hierarchical Episode-wise Risk Alignment (HERA), a unified VFM-based principle that contracts alignment risk top–down—across layers, attention, and pixels—under a frozen backbone, thereby reducing support–query mismatch. Concretely, Hierarchical Layer Routing (HLR) routes each episode to its optimal layer to stabilize semantics; Gaussian-Guided Attention (GGA) calibrates self-attention with entropy-gated Gaussian priors, strengthening locality while preserving global coverage; and Pixelwise Adaptive Reweighting (PAR) reweights per-pixel logits with lightweight residuals to recover thin structures and denoise low-contrast regions. Together these modules form a top–down risk-contraction path that unlocks ViT capacity for hierarchical semantics, structured locality, and fine-grained discrimination. By default, HERA is instantiated on DINOv3 and generalizes across ViTs. In extensive evaluations, HERA surpasses the state of the art (+6.51%) without source data or end-to-end retraining, yielding a lightweight, deployable recipe for leveraging VFMs in CD-FSS.
Supplementary Material: pdf
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 9522
Loading