Track: tiny paper (up to 4 pages)
Keywords: Mixed-Curvature Product Manifolds, Multimodal Alignment
TL;DR: ProCLIP aligns vision–language representations in a mixed-curvature product space, achieving more expressive multimodal geometry and improved image–text retrieval over single-manifold embeddings.
Abstract: Contrastive learning has become a dominant paradigm for multimodal representation learning, aligning paired observations (e.g., image-text) in a shared embedding space. Most existing approaches, however, adopt a single latent geometry, implicitly assuming that one manifold can capture the heterogeneous structure of multimodal semantics. In this work, we propose a mixed-curvature product embedding space that combines hyperbolic, Euclidean, and spherical factors within a unified latent manifold. We equip this product space with a weighted product metric and define a geometry-aware CLIP objective by measuring cross-modal similarity, yielding a drop-in replacement for standard cosine-based alignment while respecting manifold constraints. We instantiate the ProCLIP framework with lightweight manifold-specific projection heads and evaluate it on standard image-text retrieval benchmarks. Empirically, mixed-curvature product embeddings consistently improve cross-modal retrieval and alignment over single-manifold baselines, and our analyses highlight regimes in which heterogeneous curvature provides the largest gains.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 115
Loading