Keywords: Vision–Language Models; CLIP; Multi-level Alignment; Fine-Grained Alignment; Long-Context Modeling
Abstract: Pioneering vision–language models such as CLIP have transformed multimodal learning by aligning images and text in a shared embedding space. However, CLIP’s training on short captions limits its ability to handle downstream tasks that require longer text comprehension and fine-grained visual grounding. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. In this paper, we introduce \textbf{MulCLIP}, a novel end-to-end multi-level alignment framework that bridges long-text structures \textbf{(long captions, sentences, words)} with image components \textbf{(global, regional)}, enabling fine-grained capabilities while surpassing CLIP’s strength on short-text understanding. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption–aggregated patch alignment that automatically extracts and aggregate context-rich patches for each subcaption. Experimental results demonstrate MulCLIP outperforms baselines in both long- and short-text understanding, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal–assisted approaches.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10277
Loading