Escaping Plato’s Cave: JAM for Aligning Independently Trained Vision and Language Models

ICLR 2026 Conference Submission14441 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning, Representation Alignment, Multimodal Learning
TL;DR: We show Platonic alignment: JAM post-hoc aligns frozen vision and language models through lightweight autoencoders, bridging disjoint modalities and capturing fine-grained contextual distinctions to reveal shared structure.
Abstract: Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, learning objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is particularly important for tasks such as fine-grained contextual distinctions—where multiple descriptions share global semantics but differ in subtle compositional details. We tackle this setting with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment (outperforming innately multimodal models and post-hoc alignment baselines with absolute error reduction of up to 10\%, and relative error reduction of up to 80\%), offering both fundamental insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14441
Loading