Keywords: multimodal learning, modality expansion, text-only training, modality gap, cross-modal retrieval, representation alignment
TL;DR: TextMEunifies specialized modalities without paired supervision by training text-only projectors and applying centering offsets to bridge the modality gap at inference.
Abstract: Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text–image, text–audio, text–3D, text–molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging, 3D modeling, and molecular analysis. We introduce TextME, the first framework for text-only modality expansion that removes paired data requirements. Our method leverages the universal geometric properties of pre-trained encoders—consistent modality gaps—which enable zero-shot cross-modal transfer once embedding spaces satisfy these properties. We empirically verify that these hold across audio, 3D, X-ray, and molecular domains, enabling effective cross-modal tasks without paired supervision. Furthermore, we evaluated LLM and multimodal text encoders to determine which is more effective as a unified anchor space. Experiments show that TextME achieves 88.2% of paired-data performance in zero-shot classification and cross-modal retrieval, while also supporting emergent capabilities between unseen modality pairs (e.g., audio-to-3D, molecule-to-image). These results highlight text-only modality expansion as a practical and scalable path toward foundation models spanning arbitrary modalities.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16593
Loading