Keywords: visual concept, optimal transport
Abstract: Vision-language models trained with global contrastive objectives lack explicit patch-token correspondences, limiting fine-grained compositional reasoning and concept discovery.
We propose \emph{ConceptOT}, a local alignment objective that formulates cross-modal matching as entropically regularized \emph{unbalanced} optimal transport (UOT) and solves it via a low-rank Nystr"{o}m approximation built from learned concept anchors.
Unbalanced transport allows background patches and non-visual tokens to remain partially unmatched, while the low-rank structure reduces the per-iteration cost of generalized Sinkhorn updates from $\mathcal{O}(NM)$ to $\mathcal{O}((N{+}M)r{+}r^2)$ and provides a semantic bottleneck through which patches and tokens interact via shared concepts.
We compare seven scoring variants on COCO retrieval and SugarCrepe compositionality using a frozen CLIP ViT-B/16 backbone and lightweight trainable alignment modules within a short single-GPU budget.
ConceptOT outperforms all non-transport baselines on compositionality (80.5\% SugarCrepe, +2.6 over global-only, +0.9 over FILIP), while a loss-weight ablation closes the retrieval gap (45.6 Avg R@1, matching FILIP and Dense UOT).
Qualitative analysis shows that the learned concept anchors self-organize into interpretable semantic categories, suggesting that transport-based alignment can improve compositional reasoning while exposing useful concept structure in VLMs.
Project page: \url{https://misterpawan.github.io/concept-ot-project/}.
Submission Number: 41
Loading