Think as People: Context-Driven Multi-Image News Captioning with Adaptive Dual Attention

Published: 01 Jan 2024, Last Modified: 21 Feb 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Automatic image captioning has been extensively studied, however, existing methods primarily focus on a single image. Actually, the demand for captioning multiple images and corresponding contextual information has been growing in diverse scenarios, e.g., composing news articles headlines, and electronic medical reports. In this paper, we propose a novel COntext-driven captioning approach for Multi-Image News, called COMIN, which employs a two-step attention mechanism, called adaptive dual attention, comprising global attention for grasping overall context and local attention for finer image details. It is inspired by the observation and cognitive processes of human beings where global attention and local attention are responsible for understanding the high-level features and detailing the low-level features. Experimental results on our newly contributed Star-News dataset show that our proposed model outperforms the state-of-the-art image captioning methods in multi-image captioning scenarios.
Loading