Multi-modal Learning via Slot-Guided Fine-grained Alignment with Pre-trained Uni-modal Models

ICLR 2026 Conference Submission16720 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-modal learning; slot attention
TL;DR: We propose a slot-guided alignment framework that could integrate diverse pretrained uni-modal models to help multi-modal learning.
Abstract: Learning multi-modal representation with cross-modal correspondence often relies on high quality multi-modal datasets with correspondence information. Preparing multi-model datasets is costly, let alone together with the correspondence information. Recently, many pretrained uni-modal models trained with massive data have been made available, where each of them have their own set of concepts captured via representation learning. Our idea to address the multi-modal data scarcity challenge is to align a multi-modal model with uni-modal models with fine-grained cross-modal correspondence. To this end, we propose a multi-modal learning framework called slot-guided alignment (SGA) which utilizes slot attention to decompose both the multi-modal and uni-modal representations into disentangled slots. The slots obtained from the pretrained uni-modal models helps the associated concepts to be better aligned. As slot attention can be applied to diverse model architectures, a wide range of pretrained models can be leveraged. In addition, the disentangled slots from each modality allows similarity to be measured among them, which in turn allows cross-modal correspondence to be established at the slot level and enables the pretrained uni-modal models to contribute to the multi-modal representation learning in a fine-grained manner. To demonstrate the effectiveness of the SGA framework, we conduct experiments using visual-text datasets for retrieval tasks and visual question answering, and visual-audio datasets for classification tasks. We mainly enhance three baselines using Our SGA, and the results show a significant enhancement compared to the vanilla baselines and the competitive results can be achieved even with much smaller training dataset.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16720
Loading