JARA: Joint Alignment and Reconstruction Architecture for Region-Aware Vision-Language Pretraining

ICLR 2026 Conference Submission5650 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision language pretraining
TL;DR: This is a good work that study how to unify self-supervised learning with CLIP training.
Abstract: Contrastive Language-Image Pretraining (CLIP) shows strong zero-shot transfer capabilities. However, it fails to capture the intrinsic semantic structure within images and performs weak on fine-grained retrieval and dense prediction. In this work, we propose Joint Alignment and Reconstruction Architecture (JARA), a unified framework that integrates region-aware learning into CLIP via self-supervised objectives. JARA employs a Spatially Balanced Masking (SBM) strategy to decouple each image into context and masked regions uniformly. On this basis, JARA firstly replaces vision-to-vision self-distillation with Cross-Modal Self-Distillation (CMSD) to align context region's \texttt{[CLS]} tokens with paired captions. Secondly, JARA extends multi-view learning to semantic patch reconstruction to encourage the model to learn the intrinsic association across image regions, enabling region-level semantics to synchronously emerge during contrastive training. Both objectives are optimized in the same masked view, achiving an efficient single-pass training. Experiments on image-text retrieval and open-vocabulary segmentation show that JARA achieves state-of-the-art performance while remaining efficient. The code will be available after the review phase.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5650
Loading