Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta; Sanyam Kansal; Stefanie Jegelka; Phillip Isola; Vikas K Garg

Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas K Garg

Published: 02 Mar 2026, Last Modified: 14 May 2026ICLR 2026 Re-Align Workshop TalkEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Domain: machine learning

Abstract: As models and data scale, independently trained networks often induce analogous notions of similarity. Yet, similarity-based measures are weaker than precise correspondence maps between distinct models. This gap is even more consequential for multimodal models, where convergence must hold not only within each modality but also for the learned image–text coupling. We therefore ask whether, given two independently trained multimodal contrastive models, there exists a single transformation that simultaneously aligns both their image and text representations? In this work, we show that the answer is yes, and, even more, the joint transformation is a simple orthogonal map. Across contrastive multimodal families, an orthogonal map $Q \in O(d)$ fit using only a few paired data from a single modality (e.g., images) i.e., $\tilde f(x) \approx Q f(x)$, also aligns the other modality (text), $\tilde g(y) \approx Q g(y)$. We quantify this transfer in the target text space by large gains in pointwise cosine similarity and class-level nearest neighbour accuracy after transformation. Theoretically, we show that the agreement of the multimodal similarity kernel $\langle f(x), g(t)\rangle \approx \langle \tilde f(x), \tilde g(y)\rangle$ on a small, finite set of points forces a shared orthogonal map $Q$ across modalities. Broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations.

Presenter: ~Sharut_Gupta1

Submission Number: 10

Loading