Topological Alignment of Shared Vision-Language Embedding Space
TL;DR: We present ToMCLIP, a topology-aware multilingual CLIP model that preserves cross-lingual semantic structures through topological alignment.
Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities.
However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data.
Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space.
We address this problem by introducing **ToMCLIP** (**To**pological Alignment for **M**ultilingual **CLIP**), a topology-aware framework aligning embedding spaces with topology-preserving constraints.
The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy.
This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr\&CO.
Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
Submission Number: 535
Loading