ProCLIP: Product Space Multimodal Contrastive Alignment

Jiakai Chen; Hangke Sui

ProCLIP: Product Space Multimodal Contrastive Alignment

Jiakai Chen, Hangke Sui

Published: 02 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop GRaM PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny paper (up to 4 pages)

Keywords: Mixed-Curvature Product Manifolds, Multimodal Alignment

TL;DR: ProCLIP aligns vision–language representations in a mixed-curvature product space, achieving more expressive multimodal geometry and improved image–text retrieval over single-manifold embeddings.

Abstract: Contrastive learning has become a dominant paradigm for multimodal representation learning, aligning paired observations (e.g., image-text) in a shared embedding space. Most existing approaches, however, adopt a single latent geometry, implicitly assuming that one manifold can capture the heterogeneous structure of multimodal semantics. In this work, we propose a mixed-curvature product embedding space that combines hyperbolic, Euclidean, and spherical factors within a unified latent manifold. We equip this product space with a weighted product metric and define a geometry-aware CLIP objective by measuring cross-modal similarity, yielding a drop-in replacement for standard cosine-based alignment while respecting manifold constraints. We instantiate the ProCLIP framework with lightweight manifold-specific projection heads and evaluate it on standard image-text retrieval benchmarks. Empirically, mixed-curvature product embeddings consistently improve cross-modal retrieval and alignment over single-manifold baselines, and our analyses highlight regimes in which heterogeneous curvature provides the largest gains.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 115

Loading