VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman; Marufa Kamal; Fahad Rahman; Sunzida Siddique; Ahmed Rafi Hasan; Mohd Ariful Haque; kishor datta gupta; Roy George

VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, kishor datta gupta, Roy George

Published: 05 May 2026, Last Modified: 11 May 20264th ALVR PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Disaster Image Captioning, Vision Language Models, Remote Sensing, Multimodal Deep Learning, Knowledge Graph Augmentation, Semantic Knowledge Integration

TL;DR: Knowledge graph enriched captioning after initial VLM captions improve the caption quality

Abstract: General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (\ours{}), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. \ours{} operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate \ours{} on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, \ours{} with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33\% of image pairs on InfoMetIC and 73.64\% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary. intended as a continuous, extensible monitor of differential framing under changing real-world inputs.

Submission Number: 24

Loading