Topological Alignment of Shared Vision-Language Embedding Space

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0
Track: Extended Abstract Track
Keywords: Vision-Language Models, Multimodal Alignment, Topological Data Analysis
TL;DR: We present ToMCLIP, a topology-aware multilingual CLIP model that preserves cross-lingual semantic structures through topological alignment.
Abstract: Vision-Language Models (VLMs) have shown strong performance in multimodal tasks by aligning image and text representations through contrastive learning. However, their cross-modal capabilities are predominantly biased toward English due to the lack of high-quality multilingual multimodal data. Although recent multilingual extensions of VLMs have attempted to bridge this gap through knowledge distillation and continual learning, they focus on instance-level alignment and fail to preserve the global structure of the embedding space. In this paper, we propose **ToMCLIP** (**To**pological Alignment for **M**ultilingual **CLIP**), a topology-aware training framework that aligns the shared vision-language embedding space using persistent homology. To ensure scalability, we construct sparse graphs from point clouds to approximate topological features. We validate our approach, showing enhanced structural coherence of multilingual representations, higher zero-shot classification accuracy on CIFAR-100, and improved retrieval performance on xFlickr\&CO.
Submission Number: 32
Loading