Canonicalizing Multimodal Contrastive Representation Learning
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: As models and data scale, independently trained networks often induce analogous notions of similarity. Yet, similarity-based measures are weaker than precise correspondence maps between distinct models. This gap is even more consequential for multimodal models, where convergence must hold not only within each modality but also for the learned image–text coupling. We therefore ask whether, given two independently trained multimodal contrastive models, there exists a single transformation that simultaneously aligns both their image and text representations? In this work, we show that the answer is yes, and, even more, the joint transformation is a simple orthogonal map. Across contrastive multimodal families, an orthogonal map $Q \in O(d)$ fit using only a few paired data from a single modality (e.g., images) i.e., $\tilde f(x) \approx Q f(x)$, also aligns the other modality (text), $\tilde g(y) \approx Q g(y)$. We quantify this transfer in the target text space by large gains in pointwise cosine similarity and class-level nearest neighbour accuracy after transformation. Theoretically, we show that the agreement of the multimodal similarity kernel $\langle f(x), g(t)\rangle \approx \langle \tilde f(x), \tilde g(y)\rangle$ on a small, finite set of points forces a shared orthogonal map $Q$ across modalities. Broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations.
Presenter: ~Sharut_Gupta1
Submission Number: 10
Loading