Keywords: Optimal Transport, Domain Adaptation, Semantic segmentation, Label Transfer
Abstract: Vision foundation models produce features that generalize across visual domains without fine-tuning, yet naively transferring labels through these feature spaces fails under large distribution shifts.
We propose SAOT (**S**emantically **A**ware **O**ptimal **T**ransport), which learns a transport cost within a fused unbalanced optimal transport formulation for dense label transfer from frozen vision transformer features to new domains.
SAOT combines a learnable appearance metric with semantic class-prototype priors, unbalanced transport for partial matching under distribution shift, and a block-sparse solver for tractable inference.
We pair this with a two-stage decoder: an MLP trained on SAOT pseudo-labels, then refined via EMA-teacher self-training with class-balanced sampling.
On GTA5$\to$Cityscapes with frozen DINOv2 ViT-L/14 features, SAOT+Decoder reaches 25.7\% mIoU, a **3.8$\times$** improvement over nearest-neighbor transfer (6.7\%), without any backbone adaptation.
Per-class results show large gains on spatially coherent classes (road 90.3\%, car 76.2\%, building 71.5\%), demonstrating that learned semantic transport costs capture domain-invariant structure even under severe synthetic-to-real shifts. On VOC train$\to$val with frozen ViT-B/16 features, the full pipeline reaches 47.5\% mIoU, indicating that the approach extends beyond synthetic-to-real adaptation.
Submission Number: 6
Loading