Multimodal vision-language models with guided cross-attention for crisis event understanding

Multimodal vision-language models with guided cross-attention for crisis event understanding

ICLR 2026 Conference Submission12730 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal understanding, Vision Language Models, Crisis Event analysis

TL;DR: Proposes a Multimodal model, CapFuse-Net, which combines VLM captions and attention-guided fusion to enhance multimodal crisis understanding, improving accuracy, stability, and transferability.

Abstract: Understanding crisis events from social media posts to support response and rescue efforts often requires robust multimodal reasoning over both visual and textual content. However, existing models often struggle to fully leverage the complementary nature of these modalities, particularly in noisy and information-sparse settings. In this work, we propose a novel multimodal framework, CapFuse-Net, that integrates pretrained vision-language models (VLMs) with a guided fusion strategy for improved crisis event classification. We first augment textual input with VLM-generated image-grounded captions, providing richer context for textual reasoning. A Cross-Feature Fusion Module (CFM) is then used to fuse the original and generated text using cross-attention, followed by a Guided Cross-Attention module that enables fine-grained interaction between visual and textual features. To further refine this fusion, we incorporate a Differential Attention mechanism that enhances salient feature selection while suppressing noise. Extensive experiments on three crisis classification benchmarks demonstrate that our method consistently outperforms unimodal and standard multimodal baselines. In addition, an ablation study demonstrates the importance of each proposed component, in particular, the synergy between VLM-based captioning and attention-guided fusion. Finally, we present results for qualitative interpretability through Grad-CAM visualizations and robustness across diverse crisis scenarios.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12730

Loading