Keywords: VLM, Cross-Attention, Multimodal Alignment, Layer-Patch-wise Attention, Model Interpretability
TL;DR: We propose a progressive cross-attention framework (CCRA) that improves vision-language alignment by integrating layer-wise and region-wise features for better performance and interpretability.
Abstract: Vision-Language Models (VLMs) face challenges in effectively coordinating diverse cross-attention mechanisms for visual-language alignment, leading to attention drift and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-Wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding. Also, we employ a novel Progressive Attention Integration (PAI) that systematically coordinates patch-layer-wise, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing each attention's benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced VLMs achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally-focused and semantically-aligned attention patterns.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9462
Loading