ConceptOT: Fine-Grained Vision-Language Alignment via Low-Rank Unbalanced Optimal Transport

Published: 24 Apr 2026, Last Modified: 05 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: visual concept, optimal transport
Abstract: Vision-language models trained with global contrastive objectives lack explicit patch-token correspondences, limiting fine-grained compositional reasoning and concept discovery. We propose \emph{ConceptOT}, a local alignment objective that formulates cross-modal matching as entropically regularized \emph{unbalanced} optimal transport (UOT) and solves it via a low-rank Nystr"{o}m approximation built from learned concept anchors. Unbalanced transport allows background patches and non-visual tokens to remain partially unmatched, while the low-rank structure reduces the per-iteration cost of generalized Sinkhorn updates from $\mathcal{O}(NM)$ to $\mathcal{O}((N{+}M)r{+}r^2)$ and provides a semantic bottleneck through which patches and tokens interact via shared concepts. We compare seven scoring variants on COCO retrieval and SugarCrepe compositionality using a frozen CLIP ViT-B/16 backbone and lightweight trainable alignment modules within a short single-GPU budget. ConceptOT outperforms all non-transport baselines on compositionality (80.5\% SugarCrepe, +2.6 over global-only, +0.9 over FILIP), while a loss-weight ablation closes the retrieval gap (45.6 Avg R@1, matching FILIP and Dense UOT). Qualitative analysis shows that the learned concept anchors self-organize into interpretable semantic categories, suggesting that transport-based alignment can improve compositional reasoning while exposing useful concept structure in VLMs. Project page: \url{https://misterpawan.github.io/concept-ot-project/}.
Submission Number: 41
Loading