MULTIMODALITY AS SUPERVISION: SELF-SUPERVISED SPECIALIZATION TO THE TEST ENVIRONMENT VIA MULTIMODALITY

ICLR 2026 Conference Submission9735 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: specialization, multimodal, transfer learning
TL;DR: Studying 'multimodality as self-supervision', to learn a representation that achieves SOTA in the test environment without using external/internet-based data
Abstract: The common approach for developing a vision model is generalism, which involves training on a large diverse dataset to cover the varied deployment environments and leads to a model that is expected to solve the problem everywhere. However, many practical applications need to operate in a specific test space, e.g., a robot deployed in a single house, and do not necessarily need to generalize to novel environments. In this work, we explore whether we can use rich multimodal data only from the test environment to pre-train a representation in a self-supervised way, without access to any external data. We find that this approach can match and, in most cases, outperform generalists pre-trained on large-scale Internet datasets, including popular off-the-shelf models, CLIP and DINOv2. We study the effectiveness of this approach by evaluating the models on various datasets and downstream tasks, such as semantic segmentation, captioning, and object detection, as well as a set of ablations and analyses to extract insights. This approach raises intriguing points on substituting data with (multi)modality, enabling an alternative scenario where the need for external Internet-scale datasets for pre-training models is reduced. It also shows that merely benefiting from test-space data was insufficient for achieving competitive results, and multimodality was essential for that purpose.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9735
Loading