SUMMR: Self-supervised Joint Representation Learning for Symmetric Multimodal Retrieval

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Learning, Symmetric Multimodal Retrieval
Abstract: Existing works on multimodality-to-multimodality (MM2MM) retrieval mainly focus on asymmetric retrieval, where text-image pairs in query and context serve distinct roles. In this work, we address the critical yet underexplored challenge of symmetric retrieval, where queries and contexts are interchangeable. We propose SUMMR, a novel two-stage self-supervised framework that leverages unlabeled web-scale image-text pairs, contrasting previous methods that heavily rely on costly supervised data. Based on the observation that both semantic alignment and discrepancies exist between the two modalities, we first learn a mask to disentangle shared and unique information within each image-text pair, allowing us to align the shared concepts while preserving modality-specific details. Then, we leverage this mask to automatically generate positive and negative samples for self-supervised contrastive learning of the final joint embedding. Complementing this framework, we introduce a novel benchmark featuring high-quality human-annotated positive and hard-negative pairs to evaluate symmetric MM2MM retrieval. On this benchmark, extensive experiments against ten SOTA methods show SUMMR surpasses the strongest supervised VLM by 3.42 points, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code will be available upon publication.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5513
Loading