Topological Alignment of Shared Vision-Language Embedding Space

Junwon You; Kang Dasol; Jae-Hun Jung

Topological Alignment of Shared Vision-Language Embedding Space

Junwon You, Kang Dasol, Jae-Hun Jung

Published: 03 Feb 2026, Last Modified: 23 Apr 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present ToMCLIP, a topology-aware multilingual CLIP model that preserves cross-lingual semantic structures through topological alignment.

Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing **ToMCLIP** (**To**pological Alignment for **M**ultilingual **CLIP**), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr\&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning. Code is available at https://github.com/junwon0/ToMCLIP.git.

Code Dataset Upload: zip

Submission Number: 535

Loading